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Abstract 

This paper develops a methodology for robust Bayesian inference through the use of disparities. Met- 
rics such as Hellinger distance and negative exponential disparity have a long history in robust estimation 
in frequentist inference. We demonstrate that an equivalent robustification may be made in Bayesian 
inference by substituting an appropriately scaled disparity for the log likelihood to which standard Monte 
Carlo Markov Chain methods may be applied. A particularly appealing property of minimum-disparity 
methods is that while they yield robustness with a breakdown point of 1/2, the resulting parameter es- 
timates are also efficient when the posited probabilistic model is correct. We demonstrate that a similar 
property holds for disparity-based Bayesian inference. We further show that in the Bayesian setting, it 
is also possible to extend these methods to robustify regression models, random effects distributions and 
other hierarchical models. The methods are demonstrated on real world data. 



1 Introduction 



In this paper we develop a new methodology for providing robust inference in a Bayesian context. When 
the data at hand are suspected of being contaminated with large outliers it is standard practice to account 
for these either (1) by postulating a heavy-tailed distribution, (2) by viewing the data as a mixture, with 
the contamination explicitly occurring as a mixture component or (3) by employing priors that penalize 
large values of a parameter (see Berger 1994 Albert 2009 Andrade and O'Hagan 2006). In the context of 
frequentist inference, these issues are investigated using methods such as M-estimation, R-estimation etc. and 



are part of standard robustness literature (see ,IIampel et al. 1986 Maronna et al.[ 2006 ?? Jur I . As is the 



case for Huberized loss functions in frequentist inference, even though these approaches provide robustness 
they lead to a loss of precision when contamination is not present in (1) and (2) above or to a distortion of 
prior knowledge in (3). Explicit modeling of outliers as in (2) also requires knoweldge of outlier configurations 
- how many mixture components to use and what distributions to use them in, for example - and may not 
be robust if these are incorrect. This paper develops an alternative systematic Bayesian approach, based on 
disparity theory, that is shown to provide robust inference without loss of efficiency for large samples. 

In parametric frequentist inference using independent and identically distributed (i.i.d.) data, several 
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demonstrated that the dual goal of efficiency and robustness is achievable by using the minimum Hellinger 
distance estimator (MHDE). In the i.i.d. context, MHDE estimators are defined by minimizing the Hellinger 
distance between a postulated parametric density fei') and a non-parametric estimate gn{') over the p- 
dimensional parameter space Q; that is, 



arg inf 

0ee 



Typically, for continuous data, gn{') is taken to be a kernel density estimate; if the probability model is 



supported on discrete values, the empirical distribution is used. More generally, Lindsay ( 1994 ) introduced 



the concept of a minimum disparity procedure; developing a class of divergence measures that have similar 



properties to minimum Hellinger distance estimates. These have been further developed in Basu et al. ( 1997) 
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and Park and Basu (2004 1. Hooker and Vidyashankar (2010a) have extended these methods to a regression 
framework. 



A remarkable property of disparity-based estimates is that while they confer robustness, they are also 
first-order efficient. That is, they obtain the information bound when the postulated density fei-) is correct. 
In this paper we develop robust Bayesian inference using disparities. We show that appropriately scaled 
disparities approximate n times the negative log-likelihood near the true parameter values. We use this as a 
motivation to replace the log likelihood in Bayes rule with a disparity to create what we refer to as the "D- 
posterior" . We demonstrate that this technique is readily amenable to Markov Chain Monte Carlo (MCMC) 
estimation methods. Finally, we establish that the expectation of the D-posterior is asymptotically efficient 
and the resulting credible intervals provide asymptotically accurate coverage when the proposed parametric 
model is correct. 

Disparity-based robustification in Bayesian inference can be naturally extended to a regression framework 
through the use of conditional density estimation as discussed in Hooker and Vidyashankar] (2010a). We 
pursue this extension to hierarchical models and replace various terms in the hierarchy with disparities. This 
creates a novel "plug-in procedure" ~ allowing the robustification of inference with respect to particular 
distributional assumptions in complex models. We develop this principle and demonstrate its utility on a 
number of examples. The use of a disparity within a Bayesian context imposes an additional computational 
burden through the estimation of a kernel density estimate and the need to run MCMC methods. Our analysis 
and simulations demonstrate that while the use of MCMC significantly increases computational costs, the 
additional cost of the use of disparities is on the order of a factor between 2 and 10, remaining implementable 
for many applications. These methods require marginalization of an exponentiated disparity with respect to 
the random effects distribution; a task that can be achieved through MCMC methods, but would otherwise 
be numerically challenging. 



The use of divergence measures for outlier analysis in a Bayesian context has been considered in |Dey and 
Birmiwal ( 1994 ) and Peng and Dey ( 1995 1 . Most of this work is concerned with the use of divergence measures 



to study Bayesian robustness when the priors are contaminated and to diagnose the effect of outliers. These 



divergence measures are computed using MCMC techniques. More recently, Zhan and Hettmansperger (2007) 



and Szpiro et al. (20101 have developed analogues of R-estimates and Bayesian Sandwich estimators. These 
methods can be viewed to be extensions of robust frequentist methods to Bayesian context. By contrast, 
our paper is based on explicitly replacing the likelihood with a disparity in order to provide a systematic 
approach to obtain inherently robust and efficient inference. 

Within the context of Bayesian analysis, robustness has been studied with respect to the specification of 
both prior and data distributions. Robustness to outliers as studied in the frequentist literature is referred 



O'Hagan (1990), Choy and Smith (1997) and Desgagne and Angers (2007). Here, outlier rejection indicates 
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that as some group of data is moved to infinity, the posterior reverts to the posterior without those obser- 
vations. This corresponds to a breakdown point of 1; a rather extreme value for frequentist robustness. We 
also obtain this breakdown point, but additionally develop a notion of an asymptotic breakdown point in 
which we examine the worst-case displacement as sample-size increases. We are able to show that this notion 
effectively describes robustness and distinguishes Bayesian methods along with regularized versions of robust 
estimators from estimators that are trivially made robust by, for example, threshholding their estimates. 

The remainder of the paper is structured as follows: we provide a formal definition of the disparities 
in Section [2] Disparity-based Bayesian inference are developed in Section |3] Robustness and efficiency of 
these estimates are demonstrated theoretically and through a simulation for i.i.d. data in Section [5] The 
methodology is extended to regression models in Section |6] The plug-in procedure is presented in Section 
[7] through an application to a one-way random-effects model. Section |8] is devoted to two real-world data 
sets where we apply these methods to generalized linear mixed models and a random-slope random-intercept 
models for longitudinal data. Proofs of technical results and details of simulation studies are relegated to an 
online appendix. 
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2 Disparities and Their Numerical Approximations 



In this section we describe a class of disparities and numerical procedures for evaluating them. These 
disparities compare a proposed parametric family of densities to a non-parametric density estimate. We 
assume that we have i.i.d. observations Xi for i = 1, . . . ,n from some density h{-). We let (/„ be the kernel 
density estimate: 

g^{x)^^±K('^) (2) 

2—1 ^ ' 

where the kernel K density and c„ is a bandwidth for the kernel. If c„ — >■ and nc„ — )■ cx) it is known 



that gn{-) is an Li-consistent estimator of h{-) (Devroye and Gyorfi 1985). In practice, a number of plug-in 
bandwidth choices are available for c„ (e.g. Silverman 1982 Sheather and Jones 1991 Engel et al. 1994). 
For non-i.i.d. data examined in Sections [6 and [7j plug-in bandwidths can be calculated from method of 
moments estimates. We have found our results to be insensitive to the choice of plug-in bandwidth selector. 



We begin by reviewing the class of disparities described in Lindsay ( 1994 ) . The definition of disparities 
involves the residual function, 



gjx) - fe{x) 
fe{x) ' 



(3) 



defined on the support of fe{x) and a function G : [— 1, oo) 7?.. G(-) is assumed to be strictly convex and 
thrice differentiable with G(0) — 0, G'(0) — and G"(0) = 1. The disparity between fg and g„ is defined to 
be 



DignJe) 



n 



G{8g^g^{x))fe{x)dx. 



(4) 



An estimate of 9 obtained by minimizing Q is called a minimum disparity estimator. Under differentiability 
assumptions, this is equivalent to solving the equation 



A{dg{x))Vefeix)dx = 0, 



where A{S) = G{6) — (1 + S)G'{S) and indicates the derivative with respect to 9. 

This framework contains Kullback-Leibler divergence as approximation to the likelihood: 

f 1 " 

KL{gn,fe) = - / (log/e(a;)) g„(x)da; « -- ^ log/e(xj) 

2 — 1 

for the choice G{d) ~ {5 + 1) log((5 + 1) up to a constant. The squared Hellinger disparity (HD) corresponds 
to the choice G{x) = [{x + 1)^/^ — 1]^ — 1. While robust statistics is typically concerned with the impact of 
outliers, the alternate problem of inliers - defined as nominally-dense regions that lack empirical data and 
consequently small values of 5e,g„ {x) - can also cause instability. It has been illustrated in the literature that 
HD down weighs the effect of large values of ^e.g,, {x) (outliers) relative to the likelihood but magnifies the 
effect of inliers. An alternative, the negative exponential disparity, based on the choice G{x) = — 1 down 
weighs the effect of both outliers and inliers. 

The integrals involved in Q are not analytically tractable and the use of Monte Carlo integration to 
approximate the objective function has been suggested in Cheng and Vidyashankar (2006). More specifically, 
if zi. 



,zjs[ are i.i.d. random samples generated from gni'), one can approximate D{gm fe) by 



1 ^ 

DignJe)^^Y.(^(^' 



9n{Zi)' 



(5) 



3 



The Zi can be efficiently generated in the form Zi = c^Wi + Xjq^ for Wi a random variable generated 
according to K and Ni sampled uniformly from the integers 1,...,7V. In the specific case of Hellinger 
distance approximation, the above reduces to 

1=1 yn [Zi) 

The use of a fixed set of Monte Carlo samples from 5,1 (•) when optimizing for 6 provides a stochastic approx- 
imation to an objective function that remains a smooth function of 9 and hence avoids the need for complex 
stochastic optimization. Similarly, in the present paper, we hold the Zi constant when applying MCMC 
methods to generate samples from the posterior distribution in order to improve their mixing properties. If 
fe is Gaussian with 9 — (/i, cr), Gauss- Hermite quadrature rules can be used to avoid Monte Carlo integration, 
leading to improved computational efficiency in some circumstances. In this case we have 

1=1 

where the ^i{9) and Wi{9) are the points and weights for a Gauss-Hermite quadrature scheme for parameters 
9 = (/i, cr). The choice between ([5| and ([6| depends on the disparity and the problem under investigation. 
When gn{-) has many local modes, ([6]) can result in choosing parameters for which some quadrature point 
coincides with a local modes. However, ([s]) can be rendered unstable by the factor fei'^i) / 9n{zi) for 9 far from 
the maximizing value of D{gn,fe)- In general, we have found ([5| preferable when using Hellinger distance, 
but that ^ performs better with negative exponential disparity. The relative com put ational cost of using 
D{gn, fe) versus D{gn, fe) in various circumstances is discussed in Online Appendix |D[ 



3 The D-Posterior and MCMC Methods 



We begin this section by a heuristic description of the second-order approximation of KL{fg, gn) by 
D{fg,gn). A Taylor expansion of KL{fg, gn) about 9 has the following first two terms: 



VlKL{gnJe) 
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le{x) 



{Vefe{x)){Vefe{x)) -Vlfeix) 



Ve/0(x)\ (Vefeix) 



feix) 



feix) 



^^Ifejx) 
feix) 



i^e,g„{x) + l)dx. 
gn{x)dx 



(7) 



where the second term approximates the observed Fisher Information when the bandwidth is small. The 
equivalent terms for D(gn, fe) are: 



^iDignJe) 



Vlfe{x)A{5e,gJx))dx 
1 



(8) 



fe{x) 



Vefeix)) (Vefeix)) (Se.g^x) + 1)A' iSg^g,Xx))dx. 



Now, if gn is consistent, <5e,g„(a;) — >■ almost surely (a.s.). Observing that A{0) ~ 0, A'{0) — —1 from the 
conditions on G and observing / Vgfe{x)dx = 0, we obtain the equality of Q and (|8|. The fact that these 
heuristics yield efficiency was first noticed by Beran (1977) (eq. 1.1). 



In the context of Bayesian methods, inference is based on the posterior 

P{x\9)tt{9) 



P(9\x) 



J P{x\9)n{9)d9' 



(9) 
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where P{x\9) — exp(^"^-^ log fg{xi)) and tt a prior density which we assume has a first moment. Following 
the heuristics above, in this paper we propose the simple expedient of replacing the log likelihood, logP{x\0), 
in ^ with a disparity: 

In the case of Hellinger distance, the appropriate disparity is 2HD^{gn, fg) and we refer to the resulting 
quantity as the H-posterior. When D(g„, fe) is based on Negative Exponential disparity, we refer to it as 
N-posterior, and D-posterior more generally. These choices are illustrated in Figure [l] where we show the 
approximation of the log likelihood by Hellinger and negative exponential disparities and the effect of adding 
an outlier to these in a simple normal-mean example. 

Throughout the examples below, we employ a Metropolis algorithm based on a symmetric random walk 
to draw samples from PD(6'|g„). While the cost of evaluating D{gn, fe) is greater than the cost of evaluating 
the likelihood at each Metropolis step, we have found these algorithms to be computationally feasible and 
numerically stable. Furthermore, the burn-in period for sampling from Pd(0|(7„) and the posterior are 
approximately the same, although the acceptance rate of the former is around ten percent higher. 

After substituting —nD{gn, fe) for the log likelihood, it will be useful to define summary statistics of 



the Z)-posterior in order to demonstrate their asymptotic properties. Since the ZJ-posterior ( 10 ) is a proper 



probability distribution, the Expected D-a posteriori (EDAP) estimates exist and are given by 

ePDio\gn)de. 



e 



and credible intervals for 9 can be based on the quantiles of Pi:i{6\gn). These quantities are calculated via 
Monte Carlo integration using the output from the Metropolis algorithm. We similarly define the Maximum 
D-a posteriori (MDAP) estimates by 

0+ = arg max Pd (6*1511) ■ 
eee 

In the next section we describe the asymptotic properties of EDAP and MDAP estimators. In particular, 
we establish the posterior consistency, posterior asymptotic normality and efficiency of these estimators and 
their robustness properties. Differences between Pd{0, gn) and the posterior do exist and are described below: 

1. The disparities D{gn, fe) have strict upper bounds; in the case of Helhnger distance < HD^{gn, fe) < 
2, the upper bound for negative exponential disparity is e. This implies that the likelihood part of 
the D-posterior, exp(— nZ)(g„, /g)), is bounded away from zero. Consequently, a proper prior 7r(0) is 
required in order to normalize Poi^^lgn)- A random 6 from Tr{0) must also have finite expectation in 
order for the EDAP to be defined. In particular, uniform priors on unbounded ranges, along with 
most reference priors, cannot be employed here. Further, the tails of PD(0|g„) are proportional to 
that of tt{9). As a consequence, the breakdown point for the EDAP, as traditionally defined, is 1. 
Although note that in Section |4] we propose an modified definition of breakdown which is appropriate 
for regularized and Bayesian estimators under which EDAP has a breakdown of 1/2. 

These results do not affect the asymptotic behavior of Pd (^ISn) since the lower bounds decrease with n. 
Modified D-posteriors based on a transformation m{D{gn, fe j) that removes the upper bound can be 
defined without affecting either the efficiency or robustness of the resulting EDAP estimates. However, 
appropriate transformations m will depend on the the parametric family fe and are beyond the scope 
of this paper. 

2. In Bayesian inference for i.i.d. random variables, the log likelihood is a sum of n terms. This implies 
that if new data Xn+i, . . . , X„* are obtained, the posterior for the combined data Xi, . . . , Xn* can be 
obtained by using posterior after n observations, P{9\Xi, . . . ,X„) as a prior 9: 



P{9\Xi ) cx P(X„+i, . . . , \9)P{9\Xi, . . . , X„). 
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Figure 1: Left: A comparison of log posteriors for /i with data generated from iV(/i, 1) with n = 1 using an 
A^(0, 1) prior for /i. Middle: influence of an outlier on expected D-a posteriori (EDAP) estimates of ^ as the 
value of the outlier is changed from to 20. Right: influence of the prior as the prior mean is is changed 
from to -10. 



By contrast, D{gn, fe) is generally not additive in g„; hence Pd(6'|5„) cannot be factored as above. 
Extending arguments in Park and Basu (2004), we conjecture that no disparity that is additive in g„ 



will yield both robust and efficient posteriors. 

3. While we have found that the same Metropolis algorithms can be effectively used for the D-posterior 
as would be used for the posterior, it is not possible to use conjugate priors with disparities. This 
removes the possibility of using conjugacy to provide efficient sampling methods within a Gibbs sampler, 
although these could be approximated by combining sampling from a conditional distribution with a 
rejection step. 



The idea of replacing log likelihood in the posterior with an alternative criterion occurs in other settings. 
See Sollich (2002), for example, in developing Bayesian methods for support vector machines. However, we 



replace the log likelihood with an approximation that is explicitly designed to be both robust and efficient, 
rather than as a convenient sampling tool for a non-probabilistic model. 



4 Robustness 

The appeal of disparity-based methods is that in addition to the statistical efficiency of the estimators 
defined above when the parametric model is correctly specified, these estimators are also robust to contam- 
ination by data taking large values. As may be expected from the results above, EDAP estimators behave 
similarly to their minimum-disparity counterparts at finite levels of contamination at large but finite values. 
However, classical measures of robustness - influence functions and breakdown points - are based at limiting 
values, either of infinitesimal contamination levels or contaminating values at infinity where the convergence 
between EDAP and minimum-disparity estimators fails. This is due to a lack of uniformity and we argue 
that the direct application of the robustness measures listed above do not provide an accurate description of 
the behavior of EDAP estimates. Instead, robustness should be measured by the properties of the pointwise 
limit of a-level influence functions. 

As noted in the introduction, robustness to outliers is treated under title of "outlier rejection" in Bayesian 
analysis and generally corresponds to a breakdown point of 1. As we show below, the analysis of robustness 
we propose also reconciles Bayesian and other regularized estimates with the traditional description of a 
robust estimator as having breakdown point of 1/2. A Bayesian analysis of outlier rejection for our methods 
can be undertaken using the analysis techniques developed here; it is omitted for the sake of brevity. 
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To describe robustness, we view our estimates as functionals Tn{h) mapping the space of densities to 
In particular, we examine the EDAP estimate 



Tnih) 



TT{e)de 



(11) 



and note that in contrast to classical approaches to analyzing robustness the interaction between the disparity 
and the prior requires us to make the dependence of r„ on n explicit. This dependence is shared by any 
estimator that incorporates priors - including all classical Bayesian methods - and affects the traditional 
measures of robustness as examined below. Note that here, T„ is taken to be a deterministic sequence of 
maps for the space of densities to 0. 

We analyze the behavior of T„{h) under the sequence of perturbations hz.a{x) = (1 — oi)g{x) + atz{x) 
for a sequence of densities tz{-) and < a < 1. Here we assume that tz{-) is a contaminating sequence 
defined so that it becomes orthogonal to both g and the parametric family for large z. Note that unlike our 
examination of efficiency below, in these analyzes we do not require that g belongs to the parametric family; 
thus ha.z describes the effect of adding outliers to a fixed kernel density estimate. We also assume that fg 
and g become orthogonal at large values of 9. 



lim / tz{x)g{x)dx — 



lim / tz{x)fg{x)dx 



0, ve* e e 



lim 



sup 

\e\\>e* 



g{x)fg{x)dx = 0. 



(12) 
(13) 



(14) 



Typically, tz{-) is taken to be a uniform distribution on a small neighborhood centered at z; but these 



conditions are clearly more general. They extend those given in Park and Basu (2004) in not requiring g 
to be a member of the parametric family fg. The a- level influence function is then defined analogously to 



Beran (1977) by 



IF,,„(z) = a-i [Tnihz^^) - T,{h)] 
where we again note that the dependence of IFq^„(z) on n is induced by the prior. 



(15) 



( 15 ) represents a complete description of the behavior of our estimator in the presence of contamination, 



up to the shape of the contaminating density. While for EDAP estimators it contains an explicit dependence 
on n, we begin by observing its limit for large n. Firstly, as with more classical Bayesian estimates, EDAP 
estimators approach their frequentist counterparts at a rate. 

Theorem 1. Assume that G has Jour continuous derivatives, that fg is four times continuously differentiahle 
in 9 and that the third derivatives of ti are bounded. Define T„ {h) as in (11) and the minimum disparity 
estimator (MDE) as 

9{h) ^nigmm D{hje) (16) 
see 

then Tn(h) — 9{h) = Op{n~^). 



As an immediate corollary, the a-level influence functions converge at the same rate 
Corollary 1. Under the conditions of Theorem^ define 

then for every a and z, 



(17) 
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That is, the influence function for 6{-) represents a reasonable description of the behavior of T„(-). In 
particular, in the case of Hellinger distance methods. Theorems 5 and 6 in Beran (1977) have direct analogues 
for EDAP and MDAP estimators respectively. 

While Corollary [l] motivates using the properties of IFq^oo(-z) to describe the robustness of of T'„(-), we 
note that the convergence in (17) need not be uniform in z. The classical summaries of robustness properties 
investigated below are focussed on extremal values of IFq,_„(0); the breakdown point at large z and the 



classical influence function at small a. The lack of uniformity in (17) means that these summaries when 
applied at finite values of n need not reflect the asymptotic properties as described in IFq^oo(^)- We explore 
this discrepancy below and argue that the asymptotic influence measure is more appropriate in the sense 
of representing a minimax approach to robustness. The arguments employed here are broadly applicable to 
robust estimators that depend on n; in particular, regularized versions of robust estimators are susceptible 
to the same discrepancies and our analysis provides a framework for describing robustness in this context as 
well. 



4.1 Breakdown Point 



We begin by motivating a version of the breakdown point for n-dependent estimators. Classically, the 
breakdown point is defined to be 



(18) 



B{Tn) = sup < a ■■ sup |IFq,„(z)| < oo 



(see Huber (1981)). [Beran (1977) and [Park and Basu (2004) demonstrated that the MDE has a breakdown 



point of 1/2 in the case of Hellinger distance and when G{-) and G'(-) are bounded respectively. In contrast, 
we show in Theorem [s] and Corollary [2] below that for each fixed n, B{Tn) = 1. The distinction between 
these cases motivates an alternative measure that captures the intuition of the classical breakdown point 
when applied to a sequence of estimators that changes over n. We define the asymptotic breakdown point of 
the sequence {Tn}^^i, as follows 



B* ({Tjc}) — sup < a : limsupsup |IFq,.„(z)| < oo 

71 Z 



(19) 



That is, for each n we consider the maximal displacement under a-level contamination and declare a break- 
down if the limit of these displacements is unbounded. 

It is easy to see that MDAP estimators have asymptotic breakdown 1/2 if their MDE counterparts do 
and log7r(6') convex. Writing the MDAP estimator as 



f„(/i) ^ aigmax{nD{h,f0) -logTr{d)) 
e<£0 



(20) 



it is readily seen that if 6* maximizes 7t{9) then |T„(ft,) — 9* \ < \0{h) — 9* \ for 9{h) sufficiently large and thus 
if IFq,^oo(^) is uniformly bounded, so is the influence function of r„. Nonetheless, the convergence of Tn{h) 
to 0{h) means that if 9{ha,zk) ^ cxd there is a sequence so that Tn^{ha^zk) — >■ oo as fc — )■ oo. 

For EDAP estimators a uniform identifiability condition is required. We also impose boundedness on G 
and G", but note these conditions do not hold for Hellinger distance, however, a direct argument can be given 
and hence in Theorem [2] below and in other results we will sometimes state that "The results also hold for 
Hellinger Distance" . 



Theorem 2. Under the conditions of Theorem I] additionally assume G{ ) and G' {■ 
J ||6'||27r(6')(i0 < oo, then under conditions ( T^p^ ) if, 



inf inf D{at^Jg) > inf ^((l - a)g, fg) + 5, 
2 eee see 



hounded and let 



(21) 
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for a < 1/2, the asymptotic breakdown point of Ti,T2, . . . is equal to the breakdown point of 9{h). The result 
continues to hold when D is given by Hellinger distance. 

The identifiabihty condition imposed above ensures that tz does not more closely resemble the family fg 



than g. In the analysis of Park and Basu (2004), g = fg is assumed; under these conditions i?*({T„}) = 1/2. 



If we replace g by the estimate it could happen that the inequality in ( 21 ) is reversed (ie for every z there 
is a fg_, closer to tz than any member of fg is to gn). In this case, the breakdown point of 0{h) could be 
strictly less than 1/2. 



These results are in contrast to the treatment of T„ for fixed n. Here we follow Beran (1977) in evaluating 
\imz^ooTn{ha.z) for each z. 

Theorem 3. Under the contions of Theorem^ 

lim Tn{ha,z) = Tn{{^ - a)g). 
This result also holds for Hellinger distance. 



The condition that D{g,fg) be bounded holds if |G(-)| is bounded; this is assumed in Park and Basu 



(20041 and holds for the negative exponential disparity and Hellinger distance (0 < 2HD{g, fg) < 4). 



For MDE's, taking g = fg^ yields T'„((l — a)fea) = ^'o■ For EDAP estimators, the (1 — a) factor generally 
results in a reduction in strength in the disparity relative to the prior. For Hellinger distance 

n2HD{{l - a)g, fg)=A- 4^1^ / V9ix)f9ix)dx = nVT^2HD{g, f) - 4(1 - VT^) 



since the second term in canceled in normalizing the D-posterior this is equivalent to reducing n by a factor 

We note here that while the above discussion examines the behavior of T„{ha_z) for small a, it can readily 
be extended to the following corollary 

Corollary 2. Let D{g,fg) be bounded for all 9 and all densities g and let J ||^||27r(0)(i6' < oo, then the 
breakdown point of the EDAP is 1. 

A simple direct proof is given for this in Online Appendix [A| We observe that D{0,fg) — G(— 1) is 
independent 016*, yielding T„(0) = J 0T:{e)d9: the prior mean. 

The results at fixed n indicate an extreme form of robustness that results from the fact that the disparity 
approximation to the likelihood is weak in its tails. This produces a lack of equivariance in the resulting 
estimator that appears in the third term of the asymptotic expansion as shown in Equation [34] of the online 
appendix. The fixed n result does not distinguish our estimator from alternative estimators that are clearly 
problematic. In particular, the threshold estimator of the mean defined by 



JxdF{x) \ f xdF{x)\ < lOOOn 

lOOOn * sign (J xdF{x)^ otherwise 



is also efficient and has breakdown point 1. However, by considering contamination with tz-^{x) taken to be 
uniform on [lOOOn — 1, lOOOn] it is readily seen that the asymptotic breakdown point is i3*({TO„}) = 0. We 
might also contemplate a mean estimate based on a penalized Huber loss: 



hnif) = argminn / H{x — ii)dF{x) + A„/i 
/i J 
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where H is the Huber loss function (see Huber (1981)) and A„ — >■ 0. 
sufficiently large, the breakdown point of h{f) is also 1, but B*{{hn}) 
will not be efficient. 



Since Ii{z) increases linearly for z 
= 1/2. Of course in this case, /i„(/) 



While we have suggested that the distinction between the finite n and asymptotic breakdown point is a 



reflection more on the definition ( 18 1 than the properties of EDAP, it does leave considerable room for the 



design of unbounded disparities that are nonetheless robust and which would therefore also allow the use of 
improper priors while still obtaining proper D-a posteriori distributions. 



4.2 Influence Function 



An alternative measure of robustness is given by the influence function (Hampel 1974): 

IFo,n(2) = lim IFa^„(z) 



(22) 



That this function need not always provide a useful guide to the behavior of Tn{ha,z) was observed in Beran 



(19771 and further expanded in Lindsay (1994) who demonstrated that all MDE's that yield efhciency share 



the same influence function as the MLE whatever their behavior at gross levels of contamination. The analysis 
in Lindsay (19941 implicitly assumes an equivariant estimator so that T„(/e) = 6 for any 6. When the effect 



of a prior is included in the analysis a different result is obtained at finite samples, but an equivalent limiting 
result can be derived. 

To examine the influence function for EDAP estimators, we assume that the limit may be taken inside 

Epo{e\g)Cz{d,g) 



E 



all integrals in ( 22 ) and obtain 

= nCovp^(eig) {e,Czi0,g)) ■ 
where Ep^(^g\g^ indicates expectation with respect to the D-posterior with fixed density g and 



1 fe{x)dx 



= / G' 



e=0 

- 1 ) {t^{x) - g{x))dx. 



Je{x) 

Here we observe that IFo^nX^) depends on the prior vr. This is the case for any a posteriori estimate. IFo,n(2;) 
also depends on the disparity employed as demonstrated in 

Theorem 4. Let D(g, fg) he hounded and assume that 



Co = sup 



G' 



fe{x) 



1 Trie) 



d9 < oo and e\ — sup 



dd < (X) 



(23) 



then \IF{d;g,tz)\ < oo. 



In the case of Hellinger distance the conditions of Theorem|4]require the boundedness of r{x) = / {\/ fe{x) / \/ g{x))7T{9)d9 
which may not always hold (e.g. Beran 19771. 



Despite this strong result, in an asymptotic sense the choice of disparity and prior is indestiguishable 
when g is assumed to lie within the model class. Expanding Cz{0,g) about 6 = Ep^^g^g-j9 provides 



IFo,„(z) = nEp^(^e\g) {& - 



iD{eg)-'c'M + {o~eg)c'A9+) + o{n-^/^) 
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where 6* is lies between 6 and 6 and 6*+ between 9 and 6'„, since (6* - 6*3) and - Og) are Op(n-i/2) and 
o(n~^/^) respectively. We now observe that when g = fg^ is in the model class, 



- g{x))dx 



is independent of both the disparity and the prior, as is Id{&o) and the limiting value of lFo^n{z) coincides 
with the influence function of the MLE. We also note that the next leading term in the expansion above is 
C"(0), which co-incides with the second-order approximation in Lindsay (1994 Eqn. 7). Unlike the case of 



the MDE, however, here the second order term does affect the influence function at finite n. 



5 Efficiency and Numerical Results 

While there is a large literature on robust estimation methods, disparity-based estimation methods also 
achieve statistical efficiency when g is a member of the parametric family fg. In this section, we present 
theoretical results for i.i.d. data to demonstrate that inference based on the D-posterior is also asympot- 
ically efficient. We also conduct a simulation study to demonstrate the finite-sample performance of these 
estimators. 



5.1 Efficiency 

We recall that under suitable regularity conditions, expected a posteriori estimators are strongly con- 



sistent, asymptotically normal and are statistically efficient; (see Ghosh et al. 2006, Theorems 4.2-4.3). 
Our results in this section show that this property continues to hold for EDAP estimators under regularity 
conditions on G{-) when the model {fg : 9 & Q} contains the true distribution. We define 

= VlD{gJg), and 1^(9) = VlDig^^Jg) 

as the disparity information and 9g the parameter that minimizes D{g, fg) (note that 9g here depends on g). 
We note that if <? = fg^, I^{9g) is exactly equal to the Fisher information for 9g. 

The proofs of our asymptotic results rely on the assumptions listed below. Among these are that minimum 
disparity estimators are strongly consistent and efficient; this in turn relies on further assumptions, some of 
which make those listed below redundant. They are given here to maximize the mathematical clarity of 
our arguments. We assume that Xi, . . . , X„ are i.i.d. generated from some distribution g(x) and that a 
parametric family, fg{x) has been proposed for g{x) where 9 has distribution tt. To demonstrate efficiency, 
we assume 

(Al) g{x) — fg^(x); i.e. is a member of the parametric family. 

(A2) G has three continuous derivatives with G'(0) = 0, G"(0) = 1 and |G"'(0)| < 00. 
(A3) There exists C > such that for all g and h 



sup \DigJg)^DihJg)\<C f 
see J 



g(x) — h{x)\dx. 



(A4) VgZ3((7, fg) is positive definite and continuous in 9 at 9g and continuous in g with respect to the Li 
metric. 
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(A5) For any S > 0, there exists e > such that 



sup {D{gJg)-D{gJgJ)>e 
\9-e,\>s 



(A6) The minimum disparity estimator, 6'„, satisfies 0„ 9g almost surely and ^/n{0n—0g) N{0, 1 {0) ) 



Our first result concerns the limit distribution for the posterior density of y'n^O — 9„) , which demonstrates 

that the D-posterior converges in Li to a Gaussian density centered on the minimum disparity estimator 0„ 

-1 



with variance 



This establishes that credible intervals based on either P]j{9\xi, . . . , a;„) or from 



^n) ^) will be asymptotically accurate. 



Theorem 5. Let 0„ be the minimum disparity estimator of 9g, 7r(0) he any prior that is continuous and 
positive at 9g with Jq ||0||27r(0)(i^ < oo where || • II2 is the usual 2-norm, and tt^^ {t) be the D-posterior density 
of t = {ti,. . . ,tp) = y/n{9 — 9n)- Then, under conditions 

p/2 



lim 



\t)- 




dt 0. 



(24) 



Furthermore, (24 1 also holds with I (9g) replaced with Ini^n) 



Our next theorem is concerned with the efficiency and asymptotic normality of EDAP estimates. 

Theorem 6. Assume conditions (.^^-(^^ and Jq \\9\\2'!r{9)d9 < 00, then 

^ where 0*„ is the EDAP estimate. Further, (61* - 9g) N {0,I°{9g)). 

The proofs of these theorems are deferred to the online appendix [BJ but the following remarks concerning 
the assumptions (Al)-(A6) are in order: 

1. Assumption A[l] states that g is a member of the parametric family. When this does n ot hold, a cent ral 



limit theorem can be derived for 0„ but the variance takes a sandwich-type form; see 



the case of Hellinger distance. For brevity, we have followed Basu et al. (1997) and Park and Basu 



Beran 



(19771 in 



( 2004 ) in restricting to the parametric case. 



2. 



Assumptions jA[2]-A[5] are required for the regularity and identifiability of the parametric family fg in 
the disparity D. Note that Als] holds for Hellinger distance and if G"(-) is bounded from arguments in 



Park and Basu (2004); other disparities may require specialized demonstrations. Specific conditions for 



^6 to hold are given in various forms in Beran (1977); Basu et al. (^1997); Park and Basu (2004) and 



Cheng and Vidyashankar (2006), see conditions in Online Appendix 



(11997 



3. The proofs of these results employ the same strategies as those for posterior asymptotic efficiency (see 



Ghosh et al. 2006 for example). However, here we rely on the second-order convergence of the disparity 



4. 



to the likelihood at appropriate rates and the consequent asymptotic efficiency of minimum-disparity 
estimators, which in turn is based on a careful analysis of non-parametric density estimates. 

Since the structure of the proof only requires second-order properties and appropriate rates of con- 
vergence, we can replace D{gn, fg) for i.i.d. data with an appropriate disparity-based term for more 
complex models as long as A[6]can be shown hold. In particular, the results in [Hooker and Vidyashankar] 
(2010b I suggest that the disparity methods for regression problems detailed in Section [6] will also yield 
efficient estimates. 
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5.2 Simulation Studies 



To illustrate the small sample performance of D-posteriors, we undertook a simulation study for i.i.d. 
data from Gaussian distribution. 1,000 sample data sets of size 20 from a A^(5, 1) population were generated. 
For each sample data set, a random walk Metropolis algorithm was run for 20,000 steps using a A^(0,0.5) 
proposal distribution and a N{0, 25) prior, placing the true mean one prior standard deviation above the 



prior mean. The kernel bandwidth was selected by the bandwidth selection in Sheather and Jones ( 1991 ) 



H- and N-posteriors were easily calculated by combining the KernSmooth (original by Matt Wand. R port by 



Brian Ripley. 2009 ) and LearnBayes (Albert 2008 ) packages in R. We also report an experiment in which the 
normal log likelihood is replaced in the posterior with Tukey's biweight objective function using a cut-point 
of 4.685 as a comparison to alternative robust estimators. In order to compare computational cost, we have 
run an MCMC chain for the normal log likelihood and report these below, even though analytic posteriors 
are available. 

Expected a posteriori estimates for the sample mean were obtained along with 95% credible intervals 
from every second sample in the second half of the MCMC chain. Outlier contamination was investigated by 
reducing the last one, two or five elements in the data set by 3, 5 or 10. This choice was made so that both 
outliers and prior influence the EDAP in the same direction. The analytic posterior without the outliers is 
normal with mean 4.99 (equivalently, bias of -0.01) and standard deviation 0.223. 

The results of this simulation are summarized in Tables [l] (uncontaminated data) and [2] (contaminated 
data). As can be expected, the standard Bayesian posterior suffers from sensitivity to large negative values 
whereas the disparity-based methods remain nearly unchanged. Tukey's biweight also ignored large outliers, 
but was more sensitive than the disparity methods to larger amounts of contamination. Near-outliers at 
the smaller value of -3 resulted in similar biases across all methods. We observe that all robust estimates 
have slightly larger standard deviations than the EAP corresponding to a loss of efficiency of 2% for the 
H-posterior and 5% for the N-posterior and Tukey estimates. We speculate the increased variance from the 
N-posterior is due to its relatively Heavier tails (the maximal value of NED is compared to 4 for 2HD). 
A comparison of CPU time indicates that the use of disparity methods required a little more than twice the 
computational effort as compared to using the likelihood within an MCMC method. Further details from 
this simulation, including comparisons with Huber estimators are given in Online Appendix |E.1[ 



The influence of the prior is investigated in the right-hand plot of Figure [T] where we observe that the 
EAP and EDAP estimates are essentially identical until the prior is about 9 standard deviations from the 
mean of the data: at this point the prior dominates. However, we note that this picture will depend strongly 
on the prior chosen; a less informative prior will have a smaller range of dominance. 

Because the normal distribution is symmetric, estimating its mean is relatively easy. We therefore also 
conducted a simulation to estimate both shape and scale parameters in an exponential-Gamma distribution 
(i.e. exp(A"i) has a Gamma distribution). The details and results of this simulation are reserved to Online 
Appendix |E.l| We observed the expected behavior: EDAP estimates remained insensitive to outliers, whereas 
they significantly distorted the EAP. However in this case, the H-posterior demonstrated larger variance 
than the N-posterior which we explain as being due to the tendency of nonparametric density estimates from 
exponential-Gamma data to become bimodal: producing inliers where a large value of the parametric density 
is compared to a relatively small value of the nonparametric estimate. 



6 Disparities based on Conditional Density for Regression Models 

The discussion above, along with most of the literature on disparity estimation, has focussed on i.i.d. 
data in which a kernel density estimate may be calculated. The restriction to i.i.d. contexts severely limits 
the applicability of disparity-based methods. We extend these methods to non-i.i.d. data settings via the 
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Table 1: A simulation study for a normal mean using the usual posterior, the Hellinger posterior and the 
Negative Exponential posterior. Columns give the bias and variance of the posterior mean, coverage and 
average CPU time of the central 95% credible interval based on 1,000 simulations. These are recorded for 
the posterior, Hellinger distance (HD), negative exponential disparity (NED) and Tukey's biweight objective 
used in place of the log likelihood (Biweight). 





Bias SD Coverage Length CPU Time 


Posterior 
Hellinger 

Negative Exponential 
Biweight 


-0.015 0.222 0.956 0.873 3.393 
-0.015 0.225 0.954 0.920 7.669 
-0.018 0.229 0.973 1.022 7.731 
-0.017 0.228 0.977 1.007 3.523 



Table 2: Results for contaminating the data sets used in Table [T] with outliers. 1, 2, and 5 outliers (large 
columns) are added at locations -3, -5 and -10 (column Loc) for the posterior, Hellinger distance (HD), 
negative exponential disparity (NED) and Tukey's biweight objective used in place of the log likelihood 
(Biweight) . 







1 Outher 




2 Outliers 




5 Outlie 


rs 


Loc 


Bias 


SD 


Coverage 


Bias 


SD 


Coverage 


Bias 


SD 


Coverage 


Posterior 




















-3 


-0.164 


0.219 


0.883 


-0.300 


0.206 


0.722 


-0.637 


0.182 


0.100 


-5 


-0.264 


0.219 


0.778 


-0.490 


0.206 


0.375 


-1.053 


0.182 


0.001 


-10 


-0.513 


0.219 


0.360 


-0.965 


0.207 


0.004 


-2.093 


0.182 


0.000 


HD 




















-3 


-0.109 


0.246 


0.920 


-0.194 


0.275 


0.859 


-0.237 


0.299 


0.770 


-5 


-0.027 


0.238 


0.942 


-0.040 


0.257 


0.928 


-0.024 


0.305 


0.865 


-10 


-0.014 


0.234 


0.948 


-0.019 


0.249 


0.935 


0.018 


0.286 


0.883 


NED 




















-3 


-0.080 


0.256 


0.959 


-0.133 


0.279 


0.933 


-0.166 


0.308 


0.893 


-5 


-0.020 


0.238 


0.977 


-0.025 


0.243 


0.968 


-0.015 


0.264 


0.948 


-10 


-0.017 


0.237 


0.973 


-0.020 


0.241 


0.970 


-0.007 


0.260 


0.952 


Biweight 




















-3 


-0.091 


0.246 


0.954 


-0.175 


0.252 


0.915 


-0.443 


0.275 


0.645 


-5 


-0.018 


0.237 


0.974 


-0.019 


0.236 


0.972 


-0.022 


0.243 


0.967 


-10 


-0.017 


0.236 


0.977 


-0.018 


0.234 


0.971 


-0.018 


0.236 


0.969 
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use of conditional density estimates. This extension is studied in the frequentist context in the case of 



minimum-disparity estimates for parameters in non-Hnear regression in Hooker and Vidyashankar (2010a 



Consider the classical regression framework with data {Yi,Xi), . . . , (y„, Xn) is a collection of i.i.d. random 
variables where inference is made conditionally on Xi. For continuous Xi, a non-parametric estimate of the 
conditional density of y\x is given by Hansen (2004) and Li and Racine] ( 2007[ ): 



nc„2 Z-ii=l y c„2 J 



(25) 



Under a parametric model fe{y\Xi) assumed for the conditional distribution of Yi given Xi, we define a 
disparity between gn'^ and fg as follows: 



n 



i=l 



(26) 



As before, for Bayesian inference we replace the log likelihood by negative of the conditional disparity (26); 
that is, 

e'(^^-^)7r(0)«e-^*^'(f"''/«)7r(0). 

In the case of simple linear regression, Yi = [3^ + fiiXi + ei, 9 ~ (/3o, cr^) and fg{-\Xi) ~ 4>i3o+PiXi,a^{') 
where 4'ii,a^ is Gaussian density with mean fi and variance a^. 

The use of a conditional formulation, involving a density estimate over a multidimensional space, produces 
an asymptotic bias in MDAP and EDAP estimates similar to that found in Tamura and Boos (1986), who 
also note that this bias is generally small. Online Appendix [C] proposes two alternative formulations that 
reduce the dimension of the density estimate and the bias. 



When the Xi are discrete, (25 1 reduces to a distinct conditional density for each level of Xi. For example, 

, N, this reduces to 



in a one-way ANOVA model Yij = Xi + eij, .? = 1, 



1 



K 



We note that in this case the bias noted above does not appear. However When the rii are small, or for high- 
dimensional covariate spaces the non-parametric estimate gn{y\Xi) can become inaccurate. The marginal 
methods discussed in Online Appendix [C] can also be employed in this case. 



Online Appendix |E.2| gives details of a simulation study of this method as well as those described in 
Online Appendix [C] for a regression problem with a three-dimensional covariate. All disparity-based methods 
perform similarly to using the posterior with the exception of the conditional form in Section[6]when Hellinger 
distance is used which demonstrates a substantial increase in variance. We speculate that this is due to the 
sparsity of the data in high dimensions creating inliers; negative exponential disparity is less sensitive to this 
problem (Basu et al. 1997). 



7 Disparity Metrics and the Plug-In Procedure 

The disparity-based techniques developed above can be extended to hierarchical models. In particular, 
consider the following structure for an observed data vector Y along with an unobserved latent effect vector 
Z of length n: 

P{Y, Z,9) = Pi{Y\Z,0)P2{Z\0)P3{9) (27) 
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where Pi , P2 and P3 are the conditional distributions of Y given Z and 9 the distribution of Z given 6 and 
the prior distribution of 0. Any term in this factorization that can be expressed as the product of densities of 
i.i.d. random variables can now be replaced by a suitably chosen disparity. This creates a plug-in procedure 
in which particular terms of a complete data log likelihood are replaced by disparities. For example, if the 
middle term is assumed to be a product: 



Piz\9) = l[p{z,\e), 



inference can be robustified for the distribution of the Zi by replacing ( 27 ) with 

PDAY,Z,e) = P(y|Z,6l)e-2-D(9"(-;^).-P2(-|e))pg(g)) 

where 

1 " 
g„{z;Z) = 

In an MCMC scheme, the Zi will be imputed at each iteration and the estimate gn{'] Z) will change accord- 
ingly. If the integral is evaluated using Monte Carlo samples from g„, these will also need to be updated. 
The evaluation of D{gn{-\ Z),P2{-\0)) creates additional computational overhead, but we have found this 
to remain feasible for moderate n. A similar substitution may also be made for the first term using the 
conditional approach suggested above. 

To illustrate this principle in a concrete example, consider a one-way random-effects model: 

Fij = Zj -I- e.y , i = 1, . . . , n, j = 1, . . . , n,; 

under the assumptions 

-iV(0,a2), Z,^N{^ly) 

where the interest is in the value of /i. Let 7r(/i, tr^, r^) be the prior for the parameters in the model; an 
MCMC scheme may be conducted with respect to the probability distribution 

n I rii \ ^ 

P{Y,Z,fi,<j^,T^)=Y[ m0o,.^(y,,-Z,) []0,,,,2(Z,)7r(^,fT2,r2) (28) 
i=i \]=i J i=i 

where is the N{^,a'^) density. There are now two potential sources of distributional errors: either in 

individual observed Yij, or in the unobserved Z^. Either (or both) possibilities can be dealt with via the 
plug-in procedure described above. 

If there are concerns that the distributional assumptions on the eij are not correct, we observe that the 
statistics — Zi are assumed to be i.i.d. N{0,(j'^). We may then form the conditional kernel density 
estimate: 



7=1 



and replace ( 28 1 with 

n 

Pn, (F, Z, ^, a^ r^) = e- rnDi,i^Ht\z.,zUo.M-)) J| (z,)7r(^, a'.r^). (29) 



i=l 



On the other hand, if the distribution of the Zi is miss-specified, we form the estimate 

gn{z;Z)^^±K('-^) 
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and use 

n / 7ii \ 

«=i \j=i / 

as the D-posterior. For inference using this posterior, both fi and the Zi will be included as parameters in 
every iteration, necessitating the update of Z) or gn\-\z; Z). Naturally, it is also possible to substitute 
a disparity in both places: 

A simulation study considering all these approaches with Hellinger distance chosen as the disparity is 
described in Online Appendix |E.3[ Our results indicate that all replacements with disparities perform well, 
although some additional bias is observed in the estimation of variance parameters which we speculate to be 
due to the interaction of the small sample size with the kernel bandwidth. Methods that replace the random 
effect likelihood with a disparity remain largely unaffected by the addition of an outlying random effect while 
for those that do not the estimation of both the random effect mean and variance is substantially biased. 

While a formal analysis of this method is beyond the scope of this paper we remark that the use of density 
estimates of latent variables requires significant theoretical development in both Bayesian and frequentist 
contexts. In particular, in the context of using Pq^ appropriate inference on 9 will require local agreement 
in the integrated likelihoods 



i=i \j=i 



P n I rii \ n 

j - zM J]0^,,2(zodZi,...,dz„. 



i—l \ j — 1 / « — 1 

This can be demonstrated if the rij —> 00 and hence the conditional variance of the Zi is made to shrink at 
an appropriate rate. 

We note here that the Bayesian methods developed in this paper are particularly relevant in allowing the 
use of MCMC for these problems. A frequentist analysis could be obtained by marginalizing over the Zi 



in (291, (30 1, or (31 1. However, this marginalization is numerically challenging while it can be very readily 



obtained in a Bayesian context via MCMC methods. 



8 Real Data Examples 
8.1 Parasite Data 

We begin with a one-way random effect model for binomial data. These data come from one equine farm 
participating in a parasite control study in Denmark in 2008. Fecal counts of eggs of the Equine Strongyle 
parasites were taken pre- and post- treatment with the drug Pyrantol; the full study is presented in |Nielseii| 
et al. (2010). The data used in this example are reported in Online Appendix [F] 



For our purposes, we model the post-treatment data from each horse as binomial with probabilities drawn 
from a logit normal distribution. Specifically, we consider the following model: 

ki ^ ^\n(Ni,pi), logit(pi) N{p., cr^), i = l,...,n, 
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Figure 2: Posterior distributions for the parasite data. Left: posteriors for /i with and without an outher 
and the N-posterior. Middle: posteriors for a. Right: samples of gn based on draws from the posterior for 
Ai, . . . , A„, demonstrating an outlier at -4. 



where Ni are the pre-treatment egg counts and ki are the post-treatment egg counts. We observe the data 
(fci, Ni) and desire an estimate of /i and a. The likelihood for these data are 



l{ii,<7\k,N) = -^[kdogp^ + {N,-h)\og{l-p,)\ - ^^(log(p. 



i=l 



We cannot use conditional disparity methods to account for outlying ki since we have only one observation per 
horse. However, we can consider robustifying the pi distribution by use of a negative exponential disparity: 



5„(A;pi,...,p„) = — 



A - logit(p,;) 



l'^{fi,a\k,N) = [ki\ogp^ + (iV, - fc,) log(l - p^)] - nD{gn{-;pi, . . . 0/.,<t2 (■)) 

In order to perform a Bayesian analysis, /x was given a A''(0, 5) prior and an inverse Gamma prior with shape 
parameter 3 and scale parameter 0.5. These were chosen as conjugates to the assumed Gaussian distribution 
and are defuse enough to be relatively uninformative while providing reasonable density at the maximum 
likelihood estimates. A random walk Metropolis algorithm was run for this scheme with parameterization 
(/i, log((T), logit(pi), . . . ,logit(p„)) for 200,000 steps with posterior samples collected every 100 steps in the 
second half of the chain. c„ was chosen via the method in Sheather and Jones (1991) treating the empirical 
probabilities as data. 

The resulting posterior distributions, given in Figure [2] indicate a substantial difference between the 
two posteriors, with the N-posterior having higher mean and smaller variance. This suggests some outlier 
contamination and a plot of a sample of densities gn on the right of Figure |2] suggests a lower-outlier with 
logit(pi) around -4. In fact, this corresponds to observation 5 which had unusually high efficacy in this horse. 
Removing the outlier results in good agreement between the posterior and the N-posterior. We note that, as 



also observed in Stigler (1973), trimming observations in this manner, unless done carefully, may not yield 



accurate credible intervals. 



8.2 Class Survey Data 

Our second data set are from an in-class survey in an introductory statistics course held at GorncU 
University in 2009. Students were asked to specify their expected income at ages 35, 45, 55 and 65. Responses 
from 10 American-born and 10 foreign-born students in the class are used as data in this example; the data 
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are presented and plotted in Online Appendix |F] Our object is to examine the expected rate of increase in 
income and any differences in this rate or in the over-all salary level between American and foreign students. 
From the plot of these data in Figure [7] in Online Appendix |F] some potential outliers in both over-all level 
of expected income and in specific deviations from income trend are evident. 

This framework leads to a longitudinal data model. We begin with a random intercept model 

Yijk = boij + bijtk + €ijk (32) 

where Yijk is log income for the ith student in group j (American (a) or foreign (/)) at age tfc. We extend 
to this the distributional assumptions 

leading to a complete data log likelihood given up to a constant by 

l{Y,l3,<j^T^) = ~Yl E E2a^(^^^-fc-^o.,-/3i,ife)'-E E ^{bo^,-f3oJT (33) 
i=i je{o,/}fc=i ^=iie{aj} 

to which we attach Gaussian priors centered at zero with standard deviations 150 and 0.5 for the /3oj and 
f3ij respectively and Gamma priors with shape parameter 3 and scale 0.5 and 0.05 for Tq and a^. These are 
chosen to correspond to the approximate orders of magnitude observed in the maximum likelihood estimates 
of the boij, Pij and residuals. 

As in Section [7] we can robustify this likelihood in two different ways: either against the distributional 
assumptions on the e^j^ or on the boij. In the latter case we form the density estimate 



.(^;/3) 



-2nc„ ^-^ ^ 

i=lje{a,/} 



b - boij + I3qj 



and replace the second term in (331 with ~2nD{gn{-; (3), 0q^2(-)). Here we have used 

^ — {PaQ, Pfa, Pal, Pfl, boal, • ■ • i ^Oan, ^0/n) 

as an argument to g„ to indicate its dependence on the estimated parameters. We have chosen to combine the 
boai and the 6o/i together in order to obtain the best estimate of gn, rather than using a sum of disparities, 
one for American and one for foreign students. 

To robustify the residual distribution, we observe that we cannot replace the first term with a single 
disparity based on the density of the combined Cijk since the boij cannot be identified marginally. Instead, 
we estimate a density at each ij: 

4 



and replace the first term with — X]"=i J2j&{a /} id if, ni' P)^ ^o,a^{'))- This is the conditional form of the 
disparity. Note that this reduces us to four points for each density estimate; the limit of what could reasonably 
be employed. Naturally, both replacements can be made. 

Throughout our analysis, we used Hellinger distance as a disparity; we also centered the tk, resulting in 
boij representing the expected salary of student ij at age 50. Bandwidths were fixed within a Metropolis 
sampling procedures. These were chosen by estimating the 6oij and /3ij via least squares, and using these to 
estimate residuals and all other parameters: 

Poj — ~^0i, Cijk — Yijk — boij — Pljtk, 

- E -0 = ^ E(^o., - /3o,f . 

zjk ij 
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The bandwidth selector in Sheather and Jones ( 1991 1 was applied to the &oij ~ ^oj to obtain a bandwidth for 
gn{b; (3). The bandwidth for g\j^^^{e; /3) was chosen as the average of the bandwidths selected for the Cijk for 
each i and j. For each analysis, a Metropolis algorithm was run for 200,000 steps and every 100th sample was 
taken from the second half of the resulting Markov chain. The results of this analysis can be seen in Figure 
[3[ Here we have plotted only the differences /3/o ~ l^ao and /3/i — Pai along with the variance components. 
We observe that for posteriors that have not robustified the random effect distribution, there appears to be a 
significant difference in the rate of increase in income (P(/3/i < Pai) < 0.02 for both posterior and replacing 
the observation likelihood with Hellinger distance), however when the random effect likelihood is replaced 
with Hellinger distance, the difference is no longer significant {P{/3fi < /3ai) > 0.145 in both cases). We also 
observe that the estimated observation variance for the model is significantly reduced for posteriors in which 
the observation likelihood is replaced by Hellinger distance, but that uncertainty in the difference /3/o — /3aO 
is increased. 

Investigating these differences, there were two foreign students who's over-all expected rate of increase 
is negative and separated from the least-squares slopes for all the other students. Removing these students 
increased the posterior probability of Pai > Pfi to 0.11 and decreased the estimate of a from 0.4 to 0.3. 
Removing the evident high outlier with a considerable departure from trend at age 45 in Figure [7] in Online 
Appendix [F] further reduced the EAP of a to 0.185, in the same range as those obtained from robustifying 
the observation distribution. 



Further model exploration is possible. Online Appendix |F.1| explores the use of a random slope model 
with additional modeling techniques, where a distinction in average slope between American and foreign 
students does not appear significant when the slope distribution is robustified via Hellinger distance. 



9 Conclusions 

This paper combines disparity methods with Bayesian analysis to provide robust and efficient inference 
across a broad spectrum of models. In particular, these methods allow the robustification of any portion of a 
model for which the likelihood may be written as a product of distributions for i.i.d. random variables. This 
can be done without the need to modify either the assumed data-generating distribution or the prior. In 
our experience. Metropolis algorithms developed for the parametric model can be used directly to evaluate 
the D-posterior and generally incur a modest increase in the acceptance rate and computational cost. Our 
use of Metropolis algorithms in this context is deliberately naive in order to demonstrate the immediate 
applicability of our methods in combination with existing computational tools. We expect that a careful 
study of the properties of these methods will yield considerable improvements in both computational and 
sampling efficiency. 

The methods in this paper can be employed as a tool for model diagnostics; differences in results by 
an application of posterior and D-posterior can indicate problematic components of a hierarchical model. 
Further, estimated densities can indicate how the current model may be improved by, for example, the 
addition of mixture components. However, the D-posterior can also be used directly to provide robust 
inference in an automated form. 



Our mathematical results are given solely for i.i.d. data; ideas from Hooker and Vidyashankar (2010a 
can be used to extend these to the regression framework. Our proposal of hierarchical models remains 
under mathematical investigation, but we expect that similar results can be established in this case. The 
methodology can also be applied within a frequentist context to define an alternative marginal likelihood 
for random effects models, although the numerical estimation of such models is likely to be problematic. 
Within this context, the choice of bandwidth c„ can become difficult. We have employed initial least-squares 
estimates above, but robust estimators could also be used instead. Empirically, we have found our results to 
be relatively insensitive to the choice of bandwidth. 
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Figure 3: Analysis of the class survey data using a random intercept model with Hellinger distance replacing 
the observation likelihood, the random effect likelihood or both. Top: differences in intercepts between foreign 
and American students (left) and differences in slopes (right). Bottom: random effect variance (left) and 
observation variance (right). Models robustifying the random effect distribution do not show a significant 
difference in the slope parameters. Those robustifying the observation distribution estimate a significantly 
smaller observation variance. 
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An opportunity for further development of the proposed methodology lies in removing the boundedness 
of many disparities in common use. These yield EDAP estimates with finite-sample breakdown points of 1, 
indicating hypcr-insonsitivity to outliers. Theoretically, some form of boundedness in G has been used within 
proofs of the robustness of minimum disparity estimators. However, transformations of D{gn, fg) can yield 
non-bounded replacements for the log likelihood which retain both robustness and efficiency properties and 
this suggests an investigation of the relationship between appropriate transformations and the structure of 
the parameter space fg. 

The use of a kernel density estimate may also be regarded as inconsistent with a Bayesian context and it 
may be desirable to employ non-parametric Bayesian density estimates as an alternative. Results for disparity 
estimation are dependent on properties of kernel density estimates and this extension will require significant 

mathematical development. 

There is considerable scope to extend these methods to further problems. Robustification of the innovation 
distribution in time-series models, for example, can be readily carried though through disparities and the 
hierarchical approach will extend this to either the observation or the innovation process in state-space 
models. The extension to continuous-time models such as stochastic differential equations, however, remains 
an open and interesting problem. More challenging questions arise in spatial statistics in which dependence 
decays over some domain and where a collection of i.i.d. random variables may not be available. There are 
also open questions in the application of these techniques to non-parametric smoothing, and in functional 
data analysis. 
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A Proofs of Robustness 



A.l Proof of Theorem [T] 

Before giving the proof we remark that it follows the lines of asymptotic expansions for posterior distri- 



butions as outlined in, for example, Ghosh et al. (20061. While we have provided explicit expressions only 



for the first term of the expansion, further terms can be given analytically. 

Proof. Let 6{h) be the MDE for the density h. Using a Taylor expansion of the prior density we have: 

= TT {e{h)^ 1 + n-^'H'hi + ]^n-^t'b2t + Op {ii-^^ 
Also, from the corresponding expansion of the disparity 



nD[h,f. 



le(h)+t/^ 



-nD(h,f. 



ie(h) 



\'i^{e{h))t+'^ 



-1/2 
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^^^titjtktia^^ijki + Op 



ijkl 
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yielding 



TT {9{h) + t/Vn) e-"''('''^«»+*/^/^)+"''('''^«<'')) 



1 + ^7^ + +Op(n ) 



where ci(t) = Ubi-i + g Eii/c Utjtkas.ijk, and 



6 Z^ijTc ''I'^j'^k^i-ijk 



6 



24 



.4 , \ ' "s^jfe ,2 .2 .2 

Z^^f^^^^rk- 



Here 61 and 62 provide constants in the Taylor of the prior and and 04 provide the corresponding third 
and four-order constants in a Taylor expansion of the disparity. 

In particular, for the ith component of the EDAP vector we have that 



n] dt 



27r 



p/2 



+ [i^ih)-% ( E, "'"'^T'^^" - ^^^^ 



7r(e(/i)) 



+ Op (n 1) 



p/2 



2ir 

7--D^J,^-ll 



+ Op{n 1) 



Am) 



0, (n-^) 



Here the boundedness of the third derivatives of tt ensures the integrability of the Op{n) terms. 



(34) 



□ 



A. 2 Lemmas [T] and [2] 

The following lemmas are needed in the proof of Theorem [2] below. 
Lemma 1. Under the conditions of Theorem^ for any 5 there are 9\ > 0, 62 > and z* > such that 



and 



sup sup \D{ha^z,0) - D{atz,ta) - (1 - a)G"(oo)| < S 

\9\>e'z>z* 



sup sup \D{ha.z,0) — -D((l — a)g, ta) — aG'(oo)| < 5. 
\e\<eiz>z' 



(35) 
(36) 



The same results hold for D given by Hellinger distance taking G"(oo) = 0. 



Proof. The arguments employed here largely follow those of Park and Basu ( 2004[ Theorem 4.1). For no- 
tational convenience throughout the following we will use C{g,f) = G{g/ f — 1)/ and in particular we note 
that for / ^ 0, C{g, f) G'{oo)g. We wiU also write G* = sxvpt max(|G(t), \G'{t)\). 



For (35), define 



A(tt = { X : g{x) < max sup tz{x), sup fe{x) 

yzyz' \e\>ei 
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by assumptions (12) and (14) for any e > we can find z* and 9\ large enough that ^ g{x)dx < e and 
sup|g|>g* J^e fe{x)dx < e so that writing 



C{ha^z{x)Je[x))dx 



C{atz{x), fg{x))dx + 



for h*a,z{x) between atz{x) and ha^z{x) yeilds 



sup sup 

z>z' \e\>ei 



C{ha.zix), feix))dx - / C{atz{x), fe{x))dx 



< 2G*e. 



Similarly, on A^, taking a Taylor expansion around f — 0. 

[ C{K.z{x)Je{x))dx~{l~a)G'{(x) 



sup sup 

z>z* \e\>ei 



< 2G*e. 



Choosing e < 6/5G* then yields the result. 



For ( 36 ) we proceed in a similar manner and define 



Bg^,z' ^ {x : sup tz{x) < max g{x), sup fe{x) 
z>z' \ |e|<e| , 



where we observe that by assumptions (12) and (13), for any e > and keeping 62 bounded we can find z* 



large enough that sup^^^* /r , tz{x)dx < e and supigi^g. Jg^ fg(x)dx < e. Reversing the roles of tz and 
g in the argument above then yields the result. 

In the case of Hellinger distance, from the Cauchy- Schwartz inequality we observe that for z* sufficently 
large 



sup sup 

||e||eez>2* 



\Jha^z{x)fe{x)dx - I v/(l - a)g{x)fe{x)dx - / atz{x)fg{x)dx 



< S 



and the result follows from writing Hellinger distance as HD{f,g) = 2 — 2/ V f{x)g{x)dx. 



Lemma 2. For any a < 1/2, if there exists d > such that 

inf inf D(atz,9) > inf ^((l - a)g, 9) + 5, 
z 6»Ge Bee 

we can find z* , such that for every rj < S there exists 9* > with 



inf inf [D{h^,z,9)-D{h^^z,0{h^,z))) >v- 

Proof. From Lemmajl] for any r]* , we can find 9* and z* so that for all z > z* and \9\ > 9* 

D{ha,z,0) > D{atz,9) + (1 - a)G'(oo) - ri* 

and defining 9g = argming^Q D{{1 — a)g,9) 

D{h^^z,Og) < D{{1 - a)g, 9g) + aG'{^) + 77* 

Thus 

inf D{h^,z,e)>Diho,,z,9g)+S-2r^* 
\e\>e' 

and taking rj* — ri/2 yields the result. 



□ 



(37) 



□ 
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A.3 Proof of Theorem H 

Proof. We first observe that if the breakdown point of 9{-) is greater than a, we have 

sup \0{ha,z)\ < OO. 
z 

We observe that for multivariate B it is sufficient to prove the result for each co-ordinate and without loss of 
generality we will take 9{ha^z) > ^ig)- 

Taking 9* and rj and z* as in Lcmmaji] we now observe that we can find e* so that for \9 — 9{ha.z)\ < e* 
for all z > z* such that 

D{h^^,,9) < D{K^,,9{K^,)) + ?//2 

then 

^^nDih^..Mh.,.))^nn J-^^ ^^^^ e^[e)d9 



< mKz)+e)PD{e <9{h^,,)+e) 



<9{K,z) + e-^'>'^K{9{K^,)) 



where K{x) — ^J^"*^^ Tr{9)d9^ J^^9n{9)d9. Since 9{ha^z) is bounded if the breakdown point is greater 
than a, we observe that we can take 

K* = snp{K{9) : 61 < sup 9{K^,)} < oo 

Z>Z* 

and observe that 

IF„,„(z) <IF„,oo(z) + i^*. 
Conversely, if the breakdown point is less than a, we can find z„ such that 

d{ha,zj oo 

and from Theorem [T] for any 6 we can choose a subsequence fc„ such that 

\TkAK.z^) ~ 0{K.z„)\ < 6, Vn. 

□ 

A. 4 Proof of Theorem [3] 

Proof. We observe that for any finite M we have 



sup f0{x)tz{x)dx 

\m<MJ 

by the limiting orthogonality of fg and tz, hence we can find Mk — )■ oo, oo so that 

sup j fe{x)tzk{x)dx ^ 

\\9\\<Mk J 
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and J g{x)tz^{x)dx — > and ^uguyj^.j T^{0)d9 0. Hence 



0\\<Mk 



-nD{ha,^y ,6) 



TT{0)de 



Tr{e)de 



o,(l) 



/||e||<Affc 

where Ofc(l) indicates a limit with respect to A: — J' oo and upon observing that sup/j g Q-nD{h,e) 
Now applying Lemma [T] we have 

D{h^,,, , e) = D{{1 - a)g, 9) + aG'(c^) + Ofe(l) 

Combining these yields 



and the theorem follows. 



= r„((l-a)g)+Ofc(l) 



Ofc(l) 



□ 



A. 5 Proof of Corollary [2] 

Proof. Under the assumptions, supg g D{g, fg) — R < oo and infg g D{g, fg) — r > —oo. Now let hz^a — 
(1 - a)g + atz, then for aU 9 € O, e""-^ < e-"-°('''-.°'^«) < e-"^ Vz, Va G [0 1] and therefore 



7r(6l)d6' 



< 



J9e-W{9)d9 (^_,) 



Je-^^TT{9)d9 ^Po{e\h...r ^ j e-^RTi{9)d9 
Where Et^(-) represents expectation with respect to the prior. 



ETr9. 



□ 



A. 6 Proof of Theorem [4] 

Proof. It is sufficient to show that Ep^(^g\g^Cz{9^g) < oo and 
first of these and the second will follow analogously. 



^PniOlg) [(^Cz{9,g)] 



< oo. We will prove the 



Epa{e\g)Cnz{9,g) 



Cz{9,g)TT{9)d9 



< gn{R-r) 

<e"(«-'^)eo / \g{x)~tz{x)\dx 



{g{x)-tz{x)) / G" 



fe{x) 



1 T:{9)d9 



dx 



(38) 



where supg ^ D{g, fg) — R < oo and inf^ g -D(<7, /e) = > and ( 38 1 follows from the assumption ( 23 1 and 
the bound J \g{x) — tz{x)\dx < 2. □ 

Since tz{x) can be made to concentrate on regions where r(x) is large, we conjecture that the conditions 
in Theorem |4] are necessary. In fact, this requirement means that the H-posterior influence function will not 
be bounded for a large collection of parametric families. 
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B Proofs of Efficiency 



B.l Proof of Theorem [5] 

We begin with the following Lemma: 
Lemma 3. Let 

then under 43-43 

' \wn{t)\dt ^0 and I \\t\\2\wn{t)\dt ^ 0. 



Proof. We divide the integral into Ai {\\t\\2 > 5y/n} and A2 = {||t|l2 < i^v^}: 



\Wnit)\dt^ / \Wn{t)\dt+ / \Wn{t)\dt (39) 

and show that each vanishes in turn. First, since supggQ \D {gn, fe) — D (g, fg)\ 0, for some e > with 
probability 1 it follows that by Assumption 

3N:yn> N, sup i?(g„, ^,^) - D (g^^Jg ) > -e. 

||t||2><5 " \ "/ 



This now allows us to demonstrate the convergence of the first term in ( 39 ) 



\wn{t)\dt< f ^(^„ + t/yn)e-"''(^-^«"+'^)+"''(^"'^^") 



dt 



J Ai 

< e-"' + TTiOg) (^^^^) ^^(11^112 > VnS) (40) 



where Z is a N{0, 1 {6g)) random variable and (40) converges to zero almost surely. 



We now deal with the second term in (39). Notice that 

{gn, fe^+t/v^) - [g,,, = \t' I^AO'n) 
for 0'^ — On + it/^/n with < 7 < 1 and therefore 

wn{t) = 7r(^„ + t/^/n)e-5*'^" - ^(0g)e-5*'^°(''«)* ^ 

for every t. 

By Assumption A|4]we can choose 6 so that I^{0) >- 2M if jj^ — 6*^112 < 2S for some positive definite 
matrix M where A > B indicates t' At > t'Bt for all t. Since ||6'^ — 0„|| < S with probability 1 for all n 

sufficiently large exp ^— n_D {gn, /§ -^t^/n^ + {on, f§^j^< exp (— jt'A/i) . Therefore 

\wn{t)\dt< ( ^(^„+t/V^)e-5*'^'^*+^(0g) / e-5*^°(e3)*cit <oo. 



A. 
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and the result follows from the pointwise convergence of wit) and the dominated convergence theorem. 
We can prove J ||i||2|u'n(0M^ analogous manner by observing that on Ai 

\\t\\2\Wn{t)\dt< [ ||t||2^(^„+VV^)e'"''('"^^«" + '^)+"''(^"'^^'")dt 

+ [ 7r(0,)||i||2e-5*'^°(e.)tdt 

and on A2, \\t\\2\wn{t)\ and 

\\th\wn{t)\dt < I \\t\\M0n+t/V^)e-'^''^^'' + 7r{9g) f ||i||2e-5*^"(«')*dt < 00. 



□ 



Using this lemma, we prove Theorem [5] 

I {e„ - A iV(O,/^(0<,)), using that / |g„(t) - f0^it)\dt ^ 0, and 



sup \D (g„, fe) - D {g, fe)\ — i- 
see 



Proof. First, from Assumption A 6 
Assumption jA[3| it follows that 



and since 6'„ —4 9g and Assumption A[4| 

D (g„, fg^) ^ D {g, fgj , WgD (g„, /^J ^ Vei^ (.g, fgj , V^i^ (.g„, J ^ V^i? (g, 



{dnjg^ 



nD{gn,fg) 



Now, we write that 7r*^(t) = K^-^TT{6n+t/ ^n) exp 
so that /7r*^(t)di = 1. Let 

From Lemma [3] it follows that / |w„(t)|dt from which 



where k„ is chosen 



and 



lim 



p/2 



dt 



<K-' I M)\dt+ (jY^^ 



p/2 



K,/7r (6lg) 



p/2 



2tt 



That the result holds for I^{9g) replaced with 1^(9^) follows from the almost sure convergence of the latter 



to the former. 



□ 
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B.2 Proof of Theorem |6] 

Proof. Let t — {ti, . . . ,tp), from Theorem [s] 



tiTT* {t\xi, . . . ,Xn) 



a.s. I 



Since 6i; = £:(^„ + t/y^\Xi, . . . , X„) we have 



27r 



p/2 



p/2 



Since (4 - 6ig) 4 iV (O, /^(6'g)) , it follows that (6*; - 6lg) 4 (O, I^iOg)) ; hence 6*; is asymptoti- 
cally normal, efficient as well as robust. □ 



B.3 Efficiency Conditions for Minimum Disparity Estimators 

Here we provide conditions that ensure the consistency and asymptotic normality of the minimum- 
disparity estimator There is a slight variation through the literature in conditions required for efficiency 



(see Beran (1977), Basu et al. (1997), Park and Basu (2004) and 



conditions given below are adapted from Cheng and Vidyashankar 



Cheng and Vidyashankar (2006l). The 



( 2006 ) for the specific case of Hellinger 



distance. A small modification of these conditions will also provide the consistency and asymptotic normality 
for more general disparities under appropriate conditions on G(-). These are in addition to conditions /^j^ 

We first require conditions on the data and the proposed parametric family: 

(Dl) Xi, . . . ,Xn are i.i.d. with distribution given by the density function g{-). 

(D2) The parameter space 8 is locally compact. 

(D3) fg{-) is twice continuously differentiable with respect to 9. 



(D4) 



is continuous and bounded. 



(D5) fg ^(x) ['^efeix)] is continuous and bounded in L2 at 6* = 



(D6) fg ^^■^(x) [Vg/e(a;)] — fg ''^^^{x) ['Vgfg{x)] ['Vgfg{x)]'^ is continuous and bounded in L2 a,t 6 = 9g. 
(D7) fg^^'^(x) ['Vgfg{x)] [Ve/e(a;)]"^ is continuous and bounded in L2 at 9 ~ 9g. 

We also require conditions on the kernel function K in the kernel density estimate and its relationship to 
the parametric density family: 

(Kl) K{-) is symmetric about and J K{t)dt = 1. 
(K2) The bandwidth is chosen so that c„ — > 0, nc\ — > 0, nc„ — >■ 00. 
(K3) There is a sequence a„, n > diverging to infinity such that 
(a) For X a random variable with density fg^i-) 

lim nP {\X\ > an) = 0, 
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(b) 



sup sup sup 

n>l |a;|<a„ teR 



fe (x + ten) 



< oo, 



(c) The parametric score functions have regular central behavior relative to the bandwidth: 

1 r- ^efeM 



and 



lim , / ; dx = 



lim 



= 0, 



(d) The score functions are smooth with respect to K in an L2 sense: 



lim sup / K{t) 



V0/0 (a; + c„i) Ve/e (x) 



fg {x)dx = 0. 



This statement of assumptions in particular remove the condition that K{t) has compact support, which 



was assumed in Beran (1977); Basu et al. (1997); Park and Basu (2004). These assumptions significantly 
expand the class of kernels available for use and hence expands the applicability of Theorem [s] (see ,HookeT| 
and Vidyashankar (2010b) for formal details). In practice, it is numerically more stable to use a Gaussian 



kernel or some other distribution with support on the whole real line and we have used Gaussian kernels 



throughout our numerical experiments. We also follow Cheng and Vidyashankar (2006) in removing the 
assumption of compactness of Q, replacing it with local compactness plus the identifiability condition A[5] 



C Reducing the Kernel Density Dimension 



The conditional disparity formulation outlined above requires the estimation of the density of a response 
conditioned on a potentially high-dimensional set of covariates; this can result in asymptotic bias and poor 
performance in small samples. In this section, we explore two methods for reducing the dimension of the 
conditioning spaces. The first is referred to as the "marginal formulation" and requires only a univariate, 
unconditional, density estimate. This is a Bayesian extension of the approach suggested in [Hooker and 
Vidyashankar (2010b I. It is more stable and computationally eflacient than schemes based on nonparametric 



estimates of conditional densities. However, in a linear- Gaussian model with Gaussian covariates, it requires 
an external specification of variance parameters for identifiability. For this reason, we propose a two-step 
Bayesian estimate. The asymptotic analysis for i.i.d. data can be extended to this approach by using the 



technical ideas in Hooker and Vidyashankar (2010b). 



The second method produces a conditional formulation that relies on the structure of a homoscedastic 
location-scale family P(Yi\Xi,6,a) = f„{y — r]{Xi,9)) and we refer to it as the "conditional-homoscedastic" 
approach. This method provides a full conditional estimate by replacing a non-parametric conditional density 



estimate with a two-step procedure as proposed in Hansen (2004). The method involves first estimating the 



mean function non-parametrically and then estimating a density from the resulting residuals. 



C.l Marginal Formulation 



Hooker and Vidyashankar (2010b) proposed basing inference on a marginal estimation of residual density 



in a nonlinear regression problem. A model of the form 
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is assumed for independent from a scale family with mean zero and variance tr^ . 9 is an unknown parameter 
vector of interest. A disparity method was proposed based on a density estimate of the residuals 



yielding the kernel estimate 

and 9 was estimated by minimizing £'((/)o,i(-), (7!™'' {-,9, a)) where 00,1 is is the postulated density. As described 
above, in a Bayesian context we replace the log likelihood by 

—nD(4>Q^i{-),g'i^\-,9,a)). Here we note that although the estimated of g^\-,9,a) need not have zero 
mean, it is compared to the centered density (/)o,i(') which penalizes parameters for which the residuals are 
not centered. 

This formulation has the advantage of only requiring the estimate of a univariate, unconditional density 
gn^\',9^a). This reduces the computational cost considerably as well as providing a density estimate that 
is more stable in small samples. 



Hooker and Vidyashankar (2010b) proposed a two-step procedure to avoid identifiability problems in a 
frequentist context. This involves replacing u by a robust external estimate a. It was observed that estimates 
of 9 were insensitive to the choice of a. After an estimate 9 was obtained by minimizing D((/)o,i (•) , g^\-, 9, a)), 
an efficient estimate of a was obtained by re-estimating a based on a disparity for the residuals ei{9). A 
similar process can be undertaken here. 

In a Bayesian context a plug-in estimate for ct^ also allows the use of the marginal formulation: an MCMC 
scheme is undertaken with the plug-in value of held fixed. A pseudo-posterior distribution for a can then 
be obtained by plugging in an estimate for to a Disparity- Posterior for a. More explicitly, the following 
scheme can be undertaken: 

1. Perform an MCMC sampling scheme for 9 using a plug- in estimate for cr^. 

2. Approximate the posterior distribution for cr^ with an MCMC scheme to sample from the D-posterior 
P£,(cr2|y) ^ g-nD(g„(-,e,a),0o,i( ))7r(cr2) where 9 is the EDAP estimate calculated above. 

This scheme is not fully Bayesian in the sense that fixed estimators of a and 9 are used in each step above. 



However, we conjecture that, as in Hooker and Vidyashankar (2010b), the two-step procedure will result 



in statistically efficient estimates and asymptotically correct credible regions. We note that while we have 
discussed this formulation with respect to regression problems, it can also be employed with the plug-in 
procedure for random-effects models and we use this in Section |8?2) below. 



The formulation presented here resembles the methods proposed in Pak and Basu (1998) based on a 
sum of disparities between weighted density estimates of the residuals and their expectations assuming the 
parametric model. For particular combinations of kernels and densities, these estimates are efficient, and the 
sum of disparities, appropriately scaled, should also be substitutable for the likelihood in order to achieve an 
alternative D-posterior. 



C.2 Nonparametric Conditional Densities for Regression Models in Location- 
Scale Families 

Under a homoscedastic location-scale model (where the errors are assumed to be i.i.d.) p{Yi\Xi,9^a) = 
fa{Yi — vi^ii ^)) where fa is a distribution with zero mean, an alternative density estimate may be used. We 
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first define a non-parametric estimate of the mean function 



and then a non-parametric estimate of the residual density 



Tin..-, ' ^ 



ncni 



Cn2 



We then consider the disparity between the proposed fg^^ and gn- 



As before, 



Hansen 



D^'^^^ {gn, f) can be substituted for the log likelihood in an MCMC scheme. 

. I (c2) 

( 2004 ) remarks that in the case of a homoscedastic conditional density, gii has smaller bias than 



gn^ . This formulation does not avoid the need to estimate the high-dimensional function m„(a;). However, 
the shift in mean does allow the method to escape the identification problems of the marginal formulation 
while retaining some of its stabilization. 



Online Appendix E.2 gives details of a simulation study of both conditional formulations and the marginal 
formulation above for a regression problem with a three-dimensional covariate. All disparity-based methods 
perform similarly to using the posterior with the exception of the conditional form in Sect ion [6] when Hellinger 
distance is used which demonstrates a substantial increase in variance. We speculate that this is due to the 
sparsity of the data in high dimensions creating inliers; negative exponential disparity is less sensitive to this 



problem (Basu et al. 1997). 



D Computational Considerations and Implementation 

Our experience is that the computational cost of employing Disparity-based methods as proposed above is 
comparable to employing an MCMC scheme for the equivalent likelihood and generally requires an increase 
in computation time by a factor of between 2 and 10. Further, the comparative advantage of employing 
estimates ([s]) versus ^ depends on the context that is used. 

Formally, we assume TV Monte Carlo samples is ^ and M Gauss-Hermite quadrature points in ^ where 
typically M < N. In this case, the cost of evaluating gn{zi) in ^ is 0{nN), but this may be pre-computed 
before employing MCMC, and the cost of evaluating ^ for a new value of 9 is 0{N). In comparison, the 
use of (|6| requires the evaluation of gn{^i{0)) at each iteration at a 0{nM) each evaluation. 

Within the context of conditional disparity metrics, we assume N Monte Carlo points used for each Xi 
in the equivalent verion of ([s]) for ( 26 1 and note that in this context N can be reduced due to the additional 
averaging over the Xi. The cost of evaluating gn\zj\Xi) from (25) for all Zj and Xi is 0{n'^N) for ([5| and 
0{n^M) for Here again the computation can be carried out before MCMC is employed for ([s]), requiring 
0{nN) operations. In ([6]) the denominator of (25) can be pre-computed, reducing the computational cost of 
each iteration to 0{nM); however, in this case we will not necessarily expect M < N. Similar calculations 
apply to estimates based on gn"^^ . 

For marginal disparities gli^^ in (41) changes for each 9, requiring 0{nM) calculations to evaluate 
Successful use of ^ would require the Zi to vary smoothly with 9 and would also require the re-evaluation 
of g^\zi) at a cost of 0{nN) each iteration. Within the context of hierarchical models above, g„ varies 
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with latent variables and this the use of ([5]) will generally be more computationally efficient. The cost of 
evaluating the likelihood is always 0{n). 

While these calculations provide general guidelines to computational costs, the relative efficiency of ^ 
and ^ strongly depends on the implementation of the procedure. Our simulations have been carried out in 
the R programming environment where we have found ([5| to be computationally cheaper anywhere it can be 
employed. However, this will be platform-dependent - changing with what routines are given in pre-compiled 
code, for example - and will also depend strongly on the details of our implementation. 



E Simulation Studies 

While we have established attractive asymptotic properties for these estimators their finite sample prop- 
erties remain an important source of investigation. Since these estimates are based on nonparametric density 
estimation, we may suspect that they require large sample sizes before their asymptotic properties become 
apparent. In fact, our studies below demonstrate good performance even for small sample sizes. 



E.l Gaussian and exponential-Gamma Distributions — The i.i.d. Case 

We undertook a simulation study for the normal mean example in Figure [T] to examine the efhciency and 
robustness of Hellinger and Negative-Exponential posterior samples. 1,000 sample data sets of size 20 from a 
A^(5, 1) population were generated. For each sample data set, a random walk Metropolis algorithm was run 
for 20,000 steps using a A''(0, 0.5) proposal distribution and a A^(0, 25) prior, placing the true mean one prior 
standard deviation above the prior mean. The kernel bandwidth was selected by the bandwidth selection in 
Sheather and Jones ( 1991|). H- and N-post eriors were easily calculated by combining the KernSmooth (original 



by Matt Wand. R port by Brian Ripley., 20091 and LearnBayes (Albert 20081 packages in R. Expected a 
posteriori estimates for the sample mean were obtained along with 95% credible intervals from every second 
sample in the second half of the MCMC chain. Outlier contamination was investigated by reducing the last 
one, two or five elements in the data set by 3, 5 or 10. This choice was made so that both outliers and prior 
influence the EDAP in the same direction. The analytic posterior without the outliers is normal with mean 
4.99 (equivalently, bias of -0.01) and standard deviation 0.223. The results of this simulation are summarized 
in Table [3j As can be expected, the standard Bayesian posterior suffers from sensitivity to large negative 
values whereas the disparity-based methods remain nearly unchanged. Near-outliers at the smaller value of 
-3 resulted in similar biases across all methods. A comparison of CPU time indicates that the use of disparity 
methods required a little more than twice the computational effort as compared to using the likelihood within 
an MCMC method. 

In order to provide a comparison with classical robust methods, we conducted a comparison with an 
estimate based on a Huberized mean. Table |4] provides the results of these estimates using the Huber loss as 
a log likelihood within an MCMC scheme and by the minimizer of the Huber loss. In both cases, we tried 
cut-points between quadratic and linear costs at the normal 0.8 and 0.99 critical values. We also considered 
replacing the log likelihood within the MCMC scheme with a loss corresponding to Tukey's biweight function 
with cut point 4.685; this has the appeal of being similar to disparity methods in both having a re-descending 
influence function and tails of the likelihood that do not go to zero at infinity. We notice here that Huberized 
methods with cut point at the 0.8 critical value and Tukey's biweight method both showed noticeable increases 
in standard deviation compared to Hellinger methods and the likelihood when no outliers were added to the 
data. The negative exponential disparity had a similar increase in variance in this case which we speculate 
is due to having relatively heavier tails than Hellinger distance and therefore greater influence from the prior 
(the maximal value of NED is e^^ where as that of 2HD is 4). While Huber both methods exhibited less 
bias due to outliers than the likelihood, the bias still grew substantially with large outliers. By contrast, 
both biweight and disparity methods demonstrate redescending influence as outliers are placed at more 
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No Outliers 





Bias SD Coverage Length CPU Time 


Posterior 
Hellinger 

Negative Exponential 


-0.015 0.222 0.956 0.873 3.393 
-0.015 0.225 0.954 0.920 7.669 
-0.018 0.229 0.973 1.022 7.731 



Outliers 





1 Outlier 2 Outliers 5 Outliers 


Loc 


Bias SD Coverage 


Bias SD Coverage 


Bias SD Coverage 


Posterior -3 
-5 
-10 


-0.164 0.219 0.883 
-0.264 0.219 0.778 
-0.513 0.219 0.360 


-0.300 0.206 0.722 
-0.490 0.206 0.375 
-0.965 0.207 0.004 


-0.637 0.182 0.100 
-1.053 0.182 0.001 
-2.093 0.182 0.000 


HD -3 
-5 
-10 


-0.109 0.246 0.920 
-0.027 0.238 0.942 
-0.014 0.234 0.948 


-0.194 0.275 0.859 
-0.040 0.257 0.928 
-0.019 0.249 0.935 


-0.237 0.299 0.770 
-0.024 0.305 0.865 
0.018 0.286 0.883 


NED -3 
-5 
-10 


-0.080 0.256 0.959 
-0.020 0.238 0.977 
-0.017 0.237 0.973 


-0.133 0.279 0.933 
-0.025 0.243 0.968 
-0.020 0.241 0.970 


-0.166 0.308 0.893 
-0.015 0.264 0.948 
-0.007 0.260 0.952 



Table 3: A simulation study for a normal mean using the usual posterior, the Hellinger posterior and the 
Negative Exponential posterior. Columns in the first table give the bias and variance of the posterior mean, 
coverage and average CPU time of the central 95% credible interval based on 1,000 simulations. The second 
table provides results for 1, 2, and 5 outliers (large columns) located at -3, -5, -10 (column Loc) for the 
posterior, Hellinger distance (HD) and negative exponential disparity (NED). 

extreme values, but we observe that the biweight estimates were more strongly affected for larger numbers 
of contaminated observations. Disparity-based methods had credible intervals between 5% and 15% wider 
than those based on the likelihood, but these were smaller than those for Huberized and biweight methods 
when a prior was employed. The computing times for MCMC methods using Huber and biweight likelihoods 
are somewhat greater than for the use of the true likelihood, although computing the minimizing value is 
unsurprisingly faster. 

The problem of estimating a Gaussian mean is made relatively straightforward by the symmetry of the 
distribution. We therefore conducted a further study, estimating both shape and scale parameters in an 
exponential-gamma distribution (ie, exp(Xi) has a Gamma distribution, and the domain of the distribution 
is the whole real line). In this case, the shape parameter was chosen at 5 and the scale parameter at 0.25 and 
these were given priors with degrees of freedom 3 and 0.3 respectively. 5,000 data sets were simulated of 20 
points each and the D-posteriors were calculated as above both with and without an outlier placed at ln(20). 
For this chain a random walk Metropolis algorithm was again conducted with the random walk on the log 
shape and log scale parameters again using the LearnBayes package. Table [5] reports the tabulated results. 
We note in particular that the efficiency of the H-posterior has been considerably reduced, as have its coverage 
properties; additionally two simulation runs were removed due to poor convergence. This is explained as being 
due to the inlier effect; the skewness of the gamma distribution produces density estimates Qn that tend to 
have "holes" and the Hellinger disparity is sensitive to these; an example is give in Figure [4j By contrast, 
the negative exponential disparity is much less sensitive. 
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No Outliers 





Bias 


SD 


Coverage 


Length 


CPU Time 


Bayes 80 


-0.017 


0.229 


0.973 


0.978 


5.337 


Bayes 99 


-0.015 


0.223 


0.956 


0.877 


4.962 


Freq 80 


-0.004 


0.229 


0.948 


0.894 


0.002 


Freq 99 


-0.004 


0.223 


0.960 


0.894 


0.001 


Biweight 


-0.017 


0.228 


0.977 


1.007 


3.523 



Outliers 





1 Outher 2 OutUers 5 Outhcrs 


Loc 


Bias SD Coverage 


Bias SD Coverage 


Bias SD Coverage 


Bayes 3 
80 5 
10 


-0.102 0.237 0.948 
-0.101 0.237 0.950 
-0.101 0.237 0.946 


-0.188 0.235 0.904 
-0.188 0.235 0.907 
-0.188 0.236 0.907 


-0.450 0.241 0.584 
-0.451 0.241 0.582 
-0.450 0.241 0.574 


Bayes 3 
99 5 
10 


-0.149 0.228 0.894 
-0.151 0.231 0.893 
-0.151 0.231 0.887 


-0.282 0.221 0.757 
-0.290 0.228 0.754 
-0.290 0.228 0.760 


-0.635 0.192 0.153 
-0.704 0.231 0.147 
-0.704 0.231 0.148 


Freq 3 
80 5 
10 


-0.087 0.238 0.922 
-0.087 0.238 0.922 
-0.087 0.238 0.922 


-0.172 0.236 0.872 
-0.172 0.236 0.872 
-0.172 0.236 0.872 


-0.429 0.242 0.454 
-0.429 0.242 0.454 
-0.429 0.242 0.454 


Freq 3 
99 5 
10 


-0.140 0.230 0.898 
-0.140 0.231 0.897 
-0.140 0.231 0.897 


-0.275 0.223 0.753 
-0.279 0.229 0.751 
-0.279 0.229 0.751 


-0.633 0.188 0.115 
-0.693 0.231 0.115 
-0.693 0.231 0.115 


Biweight 3 
5 

10 


-0.091 0.246 0.954 
-0.018 0.237 0.974 
-0.017 0.236 0.977 


-0.175 0.252 0.915 
-0.019 0.236 0.972 
-0.018 0.234 0.971 


-0.443 0.275 0.645 
-0.022 0.243 0.967 
-0.018 0.236 0.969 



Table 4: Simulation results corresponding to Table|3]using Huberized and Tukey Biweight estimates Bayes 80 
and Bayes 99 correspond to replacing the log posterior with a Huber loss function using cut points at the 0.8 
and 0.99 critical values for a normal distribution. Freq 80 and Freq 99 correspond to obtaining estimators by 
minimizing the corresponding Huber loss functions. In this case, coverage corresponds to confidence intervals 
calculated as the estimator plus or minus two estimated standard errors. Biweight correspond to replacing 
the log posterior with Tukey's biweight loss with cutpoint 4.685. 
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No Outliers 






Shape 






Scale 






Bias 


Variance 


Coverage 


Bias 


Variance 


Coverage 


Posterior 


-0.088 


0.571 


0.9516 


0.004 


0.0005 


0.9562 


Hellinger 


-0.081 


1.005 


0.850 


0.010 


0.0011 


0.839 


Negative Exponential 


-0.182 


0.739 


0.9436 


0.008 


0.0007 


0.9556 


Outlier at 20 






Shape 






Scale 






Bias 


Variance 


Coverage 


Bias 


Variance 


Coverage 


Posterior 


-3.068 


0.001 





0.988 


0.00006 





Hellinger 


-0.013 


1.046 


0.8508 


0.010 


0.0011 


0.8440 


Negative Exponential 


-0.210 


0.725 


0.9496 


0.010 


0.0008 


0.9586 



Table 5: A simulation study for an exponential-gamma using the usual posterior, the Hellinger posterior 
and the Negative Exponential posterior. Columns give the bias and variance of the posterior mean and the 
coverage of the central 95% credible interval based on 5000 simulations. Note that the same data sets are 
used in for both tables, an outlier being added to the data sets when calculating the quantities in the bottom 
table. The shape parameter is given in the first column and the scale parameter in the second in each entry. 




Figure 4: Comparisons of gn{x) (dashed) and the true exponential-r(5, 0.25) density generating the data 
(solid). Hellinger distance estimators are sensitive to valleys in the density that are due to a density estimate 
naively applied to skewed data. Negative exponential disparity are robust towards these effects as well. 
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beta 


beta 1 


beta 2 


beta, 3 


time 


Likelihood 

HD 

NED 


L030 (0.137) 

n Q74 fn 1 

0.926 (0.136) 


0.996 (0.186) 
1 noo (() 21 3^ 
0.997 (0.201) 


0.993 (0.194) 
993 (n IM) 
0.994 (0.206) 


0.998 (0.217) 
994 (0 239^ 
1.001 (0.224) 


1.006 (0.301) 
980 (n 301 \ 
0.964 (0.294) 


20.251 
62.315 
70.812 


HD horn 
NED horn 


0.881 (0.181) 
0.829 (0.157) 


0.995 (0.338) 
0.993 (0.235) 


0.987 (0.198) 
0.988 (0.211) 


0.995 (0.217) 
0.996 (0.228) 


0.980 (0.296) 
0.974 (0.313) 


62.184 
70.649 


HD marg 
NED marg 


1.062 (0.133) 
1.087 (0.137) 


0.997 (0.193) 
1.000 (0.199) 


0.994 (0.203) 
0.994 (0.210) 


0.997 (0.225) 
0.998 (0.232) 


1.003 (0.309) 
1.005 (0.321) 


102.959 
102.966 


Huber EAP 
Huber min 


0.878 (0.123) 
0.919 (0.207) 


0.998 (0.192) 
0.999 (0.191) 


0.994 (0.198) 
0.995 (0.198) 


1.000 (0.221) 
1.000 (0.222) 


1.005 (0.306) 

1.006 (0.307) 


34.861 
0.013 



Table 6: Simulation results for linear regression. Columns give mean and standard deviation of EDAP 
estimates of parameters and CPU time based on 1,000 simulated data sets. Rows correspond to posteriors 
using Likelihood, using conditional density estimates with Hellinger distance (HD) and negative exponential 
disparity (NED), using conditional- homoscedastic density estimates with Hellinger distance (HD hom) and 
negative exponential disparity (NED hom), using the marginal formulation of Hellinger distance (HD marg) 
and negative exponential disparity (NED marg) and using a Huber loss scaled by a in place of the log 
likelihood (Huber EAP) and minimizing the Huber loss (Huber min). In both the later case the cut point 
was chosen at the 0.8 critical value of a standard normal. 



E.2 Linear Regression Models 

Figure [5] provides example D-posterior distributions of all regression disparities described in Sections [6] 
and [C] along with the posterior for an example 3-dimensional linear regression based on 30 points. Both 
Hellinger and negative exponential disparities were used. Covariates were generated from a standard normal 
distribution with errors also generated from a standard normal distribution. The likelihood is noticeably 
more concentrated than the disparity-based posteriors, but all exhibit broadly similar shapes. 

We have supplemented this experiment with a simulation. 1,000 data sets were simulated from a linear 
regression model on three covariates. The covariates were chosen from a three dimensional normal distribu- 
tion with unit marginal variances and correlation 0.5 between each pair of covariates. These were held fixed 
across all simulations. The parameters in the model were chosen as (/3o, (32, /^3, f^) = (1, 1, 1, 1, 1). Band- 



widths were chosen using the criterion in Sheather and Jones ( 1991 ) based on the observed covariates and 



maximum likelihood estimates of the residuals. Both Hellinger distance and negative exponential disparity 
were considered, and the conditional formulation, marginal formulation and conditional-homoscedastic form 
were used to estimate the five parameters in the model. Additionally, we obtained Huberized estimators by 
replacing the log likelihood with a Huber loss function and minimizing the Huber loss function. This resulted 
in seven estimators including a Gaussian posterior. For each estimate a random-walk Metropolis algorithm 
was run for 10,000 steps and EDAP estimates were calculated from every fifth sample in the second half of 
the chain. The results from this study are given in Table |6] Here we observe good agreement between all 
disparity-based methods and the likelihood. The choice between these will therefore depend on the amount 
of data available and the dimension of the covariate space. Computing time is reported in this simulation 
as well where disparity methods increase computing costs by a factor of between 3 and 5 while the use of a 
Huber loss within MCMC corresponds to about a 1.5 factor increase in cost. We note, however, that relative 
computing costs in interpreted languages such as R are strongly dependent on the details of implementation. 



E.3 One- Way Random Effects Models 

Figure [6] demonstrates the differences resulting from robustifying different distributional assumptions in 
the one-way random effects model described in Section [7] We simulated a set of five groups of twenty 
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Figure 5: D-posterior inference and linear regression. Top left: posterior for /?o, bottom left: posterior for a. 

(c) 

Right: posterior for each Pi, i — I, ... ,3. Thick lines: posterior based on likelihood. Solid lines: based on gn , 
dashed: based on gn , dotted: marginal formulation. Black: Hellinger distance, grey: negative exponential 
disparity. Here 'Conditional 1' indicates the conditional density estimate in Section [6] 'Conditional 2' refers 
to the conditional-homoscedastic approach. 
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No Outliers 




M 


sd(^) 


coverage 


(T 


sd(fT) 


coverage 


T 


sd(T) 


coverage 


Likelihood 


0.0013 


0.286 


0.947 


0.200 


0.0230 


0.940 


0.979 


0.234 


0.919 


HD - 


obs 


0.0035 


0.285 


0.937 


0.200 


0.0231 


0.939 


1.070 


0.342 


0.900 


HD - 


rand 


0.0021 


0.286 


0.936 


0.191 


0.0259 


0.686 


0.982 


0.231 


0.928 


HD - 


both 


0.0024 


0.291 


0.933 


0.191 


0.0258 


0.692 


1.068 


0.342 


0.899 


Outlying Random Effect 




M 


sd(/x) 


coverage 




sd{a) 


coverage 


T 


sd(r) 


coverage 


Likelihood 


0.365 


0.197 


1.000 


0.200 


0.0222 


0.930 


10.11 


0.163 


0.000 


HD - 


obs 


0.002 


0.288 


0.926 


0.200 


0.0223 


0.922 


1.088 


0.389 


0.887 


HD - 


rand 


0.353 


0.252 


1.000 


0.191 


0.0249 


0.684 


10.13 


0.196 


0.000 


HD - 


both 


0.002 


0.295 


0.917 


0.191 


0.0249 


0.674 


1.066 


0.325 


0.885 



Table 7: A simulation study for a one-way random effect model from using the posterior, replacing the 
observation likelihood with a conditional Hellinger distance, replacing the random effect likelihood with 
Hellinger distance and making both replacements. The columns give the mean, standard deviation and 
coverage of /i, and based on 1,000 simulations. The lower table indicates the effect of adding an 
outlying random effect at 40. A total of 8 simulations were excluded due to poor convergence of the MCMC 
chain. 



observations. Each group had variance 0.2 with a mean drawn from a N{^, 1) population. This produces 
a random-effects model and the goal is to estimate /i. The plots in Figure [6] show that the use of disparity 
methods in either the random effect or on the residual process or on both provide very similar distributions 
to the correct posterior. When an outlying group is added with mean 40, those methods that replace the 
random effects distribution with a disparity are unaffected while those that do not are substantially biased. 

In order to verify the apparent success of this method, we conducted a simulation study of a one-way 
random effects model with ten random effects and five observations per random effect. The random effects 
were simulated from a standard normal distribution, while the observations were Gaussian, centered on the 
random effect and with standard deviation 0.2. We simulated 1,000 versions of these data. For each version 
a random-walk Metropolis scheme was run to sample from the posterior, the posterior with the observation 
likelihood replaced by H-posterior, the posterior with the random effect likelihood replaced by H-posterior 
and the posterior with both replacements. All MCMC schemes were run for 10,000 steps with posterior 
distributions calculated based on every 5th sample from the second half of the chain. We additionally 
added a further random effect with five observations distributed around the value 40 with standard deviation 



0.2. Bandwidth parameters were chosen by the selection criterion of Sheather and Jones (19911 based on 
maximum-likelihood estimates of random effects and residual errors. The results of this simulation are 
summarized in Table [T] Here we see that the estimation of a is biased downwards by the estimation of a 
conditional density for each random effect, based on only five observations and there is more uncertainty in 
the estimate of r when Hellinger distance is used in place of the random effect log likelihood. The disparity- 
based methods otherwise perform very similarly to the true likelihood. When an outlying random effect 
is added, replacing the random effect likelihood with Hellinger distance robustified inference, where those 
posteriors without this replacement were severely biased. 



F Data 

Table [8] provides the values of the parasitology data set used in Section |8.1| The data used for the class 
survey data in Section |8.2| are provided in Table |9] they are graphed in Figure [Tj 
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Random Effect fflean 



Random Effect Mean 




-2-1012 -2-10123 



Observation Variance Observation Variance 




0.16 0.18 0.20 0.22 0.24 0.26 0.16 0.18 0.20 0.22 0.24 



Random Effect Variance Random Effect Variance 




012345 5 10 15 



Figure 6: Posterior distributions for a random-effects model. 5 groups of 20 observations each with, mean 
drawn from N{fi, 1) and variance 0.2. /i was given a N{0, 1) prior. Posterior densities for fi were estimated 
from every 50th observation in the last half of a random-walk Markov chain run for 100,000 steps. Hellinger 
distance was used to replace the component of the likelihood representing Yij — Xi, representing Xi and both. 
Left: posterior densities for /i from the original data when the model is correct. Right: posterior densities 
after adding a further group of 5 observation generated from iV(40,0.2). First row: estimates for /z, second: 
0-2, third: r^. 



Horse 


1 


2 


3 


4 


5 


6 


7 


Pre-treatment 


2440 


1000 


1900 


1820 


3260 


300 


660 


Post-treatment 


580 


320 


400 


160 


60 


40 


120 



Table 8: Data used in Section [8?T| pre- and post-treatment fecal egg count for seven horses on one farm. 
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Figure 7: Responses to an in-class survey on expected income at ages 35, 45, 55 and 65. Students were either 
foreign born (dashed) or American (sohd). 
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Table 9: Class survey data used in Section 8.2 columns give American (a) or foreign (f) status, and log 
expected salary at ages 35, 45, 55, and 65. 

F.l Random Slope Models for Class Survey Data 

A further model exploration of the class survey data allows a random slope for each student in addition 
to the random offset. The model now becomes 

yijk = boij + biijtk + €ijk (42) 
with additional distributional assumptions 

and an additional term 



added to (33). Here, this term can be robustified in a similar manner to the robustification of the boij. 
However, we note that a robustification of the error terms would require the estimation of a conditional 
density for each ij - based on only four data points. We viewed this as being too little to achieve reasonable 
results and therefore employed the marginal formulation described in Section [Cj Specifically, we first obtained 
residuals Cijk for the random slope model from the maximum likelihood estimates for each subject-specific 
effect and estimated ^ 

^ 67472 *'''^"^ " median(e)|. 
Following this, we estimated a combined density for all residuals, conditional on the random effects 
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and replaced the first term in (33) with —8nD{gn^^\-; (3), (f)f)^^2(-)). Following the estimation of all other 
parameters, we obtained new residuals Cijk = Yijk — boij — hujtk where the 5oy and huj are the EDAP 
estimators. We then re estimated cr^ based on its H-posterior using the Cijk as data. In this particular case 
a large number of outliers from a concentrated peak (see Figure |8| meant that the use of Gauss-Hermite 
quadrature in the evaluation of 

suffered from large numerical errors and we therefore employed a Monte Carlo integral based on 400 data 
points drawn from gi™'' instead, using the estimate ([5]). To estimate both and the other parameters we 
used a Metropolis random walk algorithm which was again run for 200,000 iterations with estimates based 
on every 100th sample in the second half of the chain. 

Some results from this analysis are displayed in FigurejSj The residual distribution of the Cijk show a very 
strong peak and a number of isolated outliers. The estimated standard deviation of the residual distribution 
is therefore very different between those methods that are robust to outliers and those that are not; the mean 
posterior a was increased by a factor of four between those methods using a Hellinger disparity and those 
using the random effect log likelihood. The random slope variance was estimated to be small by all methods 
- we speculate that the distinction between random effect log likelihoods and Hellinger methods is bias due 
to bandwidth size - but this was not enough to overcome the differences between the methods concerning 
the distinction between j5fi and j5ai- 
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Figure 8: Analysis of a random-slope random-intercept model for the class survey data. Top left: a density 
estimate of the errors following a two-step procedure with the error variance held constant. This shows 
numerous isolated outliers than create an ill-conditioned problem for Gauss-Hermite quadrature methods. 
Top right: estimates of residual standard deviation replacing various terms in the log likelihood with Bellinger 
distance. The effect of outliers is clearly apparent in producing an over-estimate of variance. Bottom left: 
estimated variance of the random slope. Bottom right: estimated difference in mean slope between American 
and foreign students. 
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